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ABSTRACT 


Modeling and simulation of military operations requires human behavior models capable 
of learning from experience in complex environments where feedback on action quality is 
noisy and delayed. This research examines the potential of reinforcement learning, a class 
of AI learning algorithms, to address this need. A novel reinforcement learning algorithm 
that uses the exponentially weighted average reward as an action-value estimator is de¬ 
scribed. Empirical results indicate that this relatively straight-forward approach improves 
learning speed in both benchmark environments and in challenging applied settings. Ap¬ 
plications of reinforcement learning in the verification of the reward structure of a training 
simulation, the improvement in the performance of a discrete event simulation scheduling 
tool, and in enabling adaptive decision-making in combat simulation are presented. To 
place reinforcement learning within the context of broader models of human information 
processing, a practical cognitive architecture is developed and applied to the representation 
of a population within a conflict area. These varied applications and domains demonstrate 
that the potential for the use of reinforcement learning within modeling and simulation is 
great. 
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I. INTRODUCTION 


Modeling and simulation of military operations requires human behavior models 
capable of learning from experience in complex environments where feedback on action 
quality is noisy and delayed. This dissertation examines the potential of reinforcement 
learning, a class of learning algorithms from artificial intelligence, to address this need. 
Reinforcement learning provides an empirically developed conceptual model of human 
behavior to guide the implementation of a model for human decision making within sim¬ 
ulation models (Lattal, 2010; Thorndike, 1911). Cognitive architectures are simulation- 
oriented models of individual human information processing and behavior often developed 
to emphasize specific aspects of cognition, depending on the use case (Langley, Laird, & 
Rogers, 2009; Zacharias, MacMillan, & Van Hemel, 2008). Autonomous agents are soft¬ 
ware agents designed to sense their environment and select appropriate actions to express 
in the environment-based on their perception of the state of the environment (Russell & 
Norvig, 2010). A simulation model is a representation of those elements of a real-world 
phenomena required to support either the issues for analysis or the training objectives of 
the user, dependent on the specified use case (Goerger, McGinnis, & Darken, 2005). 

A. MOTIVATING PROBLEM 

The Department of Defense of the United States relies on simulation models to 
inform decision-making processes for contingency planning and acquisition (Goerger et 
al., 2005; DMSO, 2004; Zacharias et al., 2008). Simulation models support analyses that 
provide insights into the combat effectiveness of weapon systems, the appropriate mix of 
capabilities, and the impact of tactics, techniques, and procedures (TTP) that would not 
otherwise be obtainable since, in many cases, physical experiments are infeasible or too 
costly. Training simulations serve to provide practice environments in which trainees can 
develop skill proficiency at relatively low cost and risk. The human decision maker is 
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represented in combat simulations with varying degrees of sophistication depending on the 
model (Zacharias et al., 2008). In many cases, agent decision-making is limited to a small 
set of reflexive rules, which limits the flexibility of these models and increases scenario 
development time and cost required to support analysis. 

Many recent DoD studies center around capabilities designed to improve the qual¬ 
ity and quantity of information provided to a decision maker (Zacharias et al., 2008). The 
analysis of the impact of information on combat operations requires an agent that, when 
presented with new or different sets of information from the environment, can change its 
behavior, choosing a different action-based on feedback from the environment. DoD train¬ 
ing simulations require adaptive agents capable of providing trainees with realistic training 
experiences by serving as autonomous agents within the training environment, as well as 
software agents that can adaptively adjust the training environment in order to improve the 
trainees experience. DoD combat simulations need agents capable of learning from their 
experience in the environment and choosing different actions in response to different sit¬ 
uations (Zacharias et al., 2008). The DoD analytic community also requires simulation 
agents whose decision-making is based on an empirically validated conceptual model and 
whose resultant behavior is explainable and traceable (J. Alt, Lieberman, & Blais, 2010; 
DMSO, 2004; Goerger et al., 2005; Zacharias et al., 2008). RL algorithms coupled with a 
model of human information processing, such as a practical cognitive architecture, provide 
one potential approach to address this need (J. K. Alt, Baez, & Darken, 2011; Anderson & 
Schunn, 2005; Zacharias et al., 2008; N. Taatgen, Lebiere, & Anderson, 2006). 

The temporal credit assignment problem in RL refers to the challenge of assigning 
credit or blame to actions that lead to the achievement of a goal in sequential task settings. 
Complex tasks often require numerous steps to achieve a goal, and as an individual leams 
the correct sequence of steps the challenge becomes understanding which decisions should 
be reinforced to enable a shorter path in the next iteration. Various strategies exist for ad¬ 
dressing this challenge, but it is still an open area of research within the RL community. 
The RL algorithm presented in this dissertation results in improved learning performance 
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over similar algorithms through its relatively straightforward, but novel, approach to the 
temporal credit assignment problem. This dissertation contributes insights in the applica¬ 
tion of RL to four use cases within military modeling and simulation. 

B. DEFINITIONS AND TERMINOLOGY 

This section provides a brief description of important definitions and terminology 
used in this dissertation. A software agent consists of a set of sensors, actuators, and an 
internal decision model (Russell & Norvig, 2010). The agent receives information regard¬ 
ing its environment in the form of percepts detected by its sensors. These percepts provide 
atomic-level information used by the agent to form a notion of state, which the agent uses 
to select an action to be expressed in the environment through its actuators, see Figure 1 
(Russell & Norvig, 2010). 



Figure 1: Agent environment interaction. 


Assume that a software agent is placed in a sequential decision problem in a discrete 
state space, s e S, and must choose from a finite set of actions, a G A, so as to maximize its 
long-term reward, V(s), defined as the sum of rewards provided from its reward function 
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which maps states to point rewards, R(s) —> r. If we assume the agent has access to a 
transition matrix, P * ,s ', describing the probability of transitioning from s given a to each 
of the possible states, s', then the environment is a fully observable environment (Russell 
& Norvig, 2010; Sutton & Barto, 1998). Further, this sequential decision problem is called 
a Markov decision process (MDP) since it adheres to the Markov property, see Figure 2. 



Figure 2: Simple two state MDP. 


An MDP can be fully described by the tuple, (S, A, P* ,s , R). A solution to an 
MDP, referred to as a policy, n, maps states to actions, 7t(s) —> a. If we assume that the 
preferences of the agent are stationary over-time then the accrual of long-term reward can 
be determined by a value-function, V(s) , which estimates the long-term value of a state 
using either an additive or discounted sum, where y e [0,1) is known as the discount factor 
and t is the trial count (Powell, 2011; Russell & Norvig, 2010; Sutton & Barto, 1998), 
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As y —* 0, the agent places more weight on recent events and when y —> 1 the 
discounted utility is equivalent to the additive case (Kaelbling, Littman, & Moore, 1996; 
Powell, 2011; Russell & Norvig, 2010). Each sequence of states is considered a unique 
policy, n. A policy is said to be stationary if it is deterministic and the action it chooses 
depends only on the current state, s (Ross, 1982). Since this satisfies the Markov property 
the sequence of states forms a Markov chain. Since the state transitions are Markovian it 
is appropriate to think of the value of a state, V(s), in terms of expectations and the policy 
as a mapping to the probability of taking action a in state s. We will consider only the 
discounted case going forward and can state that the expected value of executing a policy, 
7t, starting in state s is, 


OO 

V^s) = E n [Y y t r t |s t = s]. (2) 

t=o 

The optimal state-value function is the value that is the long-term reward the agent 
will gain if it starts in that state and executes the optimal policy, 7t* (Kaelbling et al., 1996), 

OO 

V*(s) = maxE{Y" yV}. (3) 

71 ‘ 

t=1 

The optimal value function is uniquely defined and can be obtained through the solution 
of a set of simultaneous equations, referred to as the Bellman equations (Kaelbling et al., 
1996), 


V*(s) =max{R s Q +y Y P^ s 'V*(s')},Vs G S. (4) 

a ‘ 

s' 
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We can now determine an optimal policy, tC , that yields the greatest expected long¬ 
term reward for a sequence of decisions (Kaelbling et al., 1996), 

7t*(s) =max{R(s)+y Y P s a ' s \* {s')}. (5) 

a * 

s' 

Assume that the agent has a fixed-time, t e T, over which to maximize its long¬ 
term reward, a finite horizon , then 7t* is said to be nonstationary , since it is a function of T 
(Powell, 2011; Russell & Norvig, 2010). If this time constraint is relaxed and the agent has 
an infinite amount of time over which to act, an infinite horizon , the optimal action depends 
only on the current state, s, and the optimal policy, n*, becomes stationary. 

The value-iteration algorithm, a technique from the field of dynamic programming, 
provides a means of solving the Bellman equations, stating that the utility of a state is equal 
to the immediate reward from the current state, s and the expected discounted utility of the 
next state, s', assuming that the agent chooses the optimal action available (Kaelbling et 
al., 1996; Ross, 1982; Russell & Norvig, 2010; Sutton & Barto, 1998), 

V^s) «- R(s)+ymax)“ ps.s'yn-i ( s ). (6) 

a * 

s' 

This iterative form of the algorithm will typically proceed until it reaches an equilibrium- 
based on the maximum error allowed in the approximation of the value of a state, V(s). 
Value-iteration serves as a basis for many of the RL algorithms. It also serves as the mo¬ 
tivation for policy iteration, a technique to iteratively improve the performance of a policy 
by solving the analytic version of the Bellman equation and the action-values, Q(s, a), 
associated with a given state, s (Dimitrakakis & Lagoudakis, 2008; Sutton & Barto, 1998), 

Q"(s,a) = R{s)+y£prV(s'). (7) 

s' 
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The action-value, Q(s, a), on which we will focus much our our research, can then 
become the basis for determining the new policy, 7t, 


7ti+i(s) =maxQ 7Il (s,a). (8) 

a 

Knowledge of the transition probabilities, P*’ s ', is seldom available in real-world appli¬ 
cations or to a human decision maker attempting to solve a complex MDP. The ability to 
leam the transition probabilities, P* ,s/ , over-time, as is the case in model-based RL, has its 
own limitations due to the potentially large number of states and actions and the need to 
maximize utility while learning. Model-free methods of RL provide a useful alternative to 
determine the optimal policy, n*, within sequential decision problems. 

If we assume that the agent has a noisy sensor that affects its perception of in¬ 
formation from the environment then the environment is no longer fully observable. The 
MDP now becomes a partially observable Markov decision process (POMDP) (Russell & 
Norvig, 2010). The elements of the MDP are still present, but now a sensor model is intro¬ 
duced. This noisy sensor and its impact on agent perception distinguish between MDPs and 
POMDPs. As a result, the agent looks for the optimal policy-based on its current perceived 
state rather than its actual state. RL techniques provide a feasible means of determining the 
optimal policy within a POMDP, but we will continue discussion of the RL problem in the 
context of an MDP often used in benchmarking the performance of RL systems, see Figure 
3 (Sutton & Barto, 1998). 
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Figure 3: Agent-environment interaction in reinforcement learning. 


A RL problem is a form of goal-directed learning from interaction, or trial and er¬ 
ror, where the learner, an agent, tries to learn what actions to take to maximize the sum of 
a numeric reward signal, which might be immediate or delayed (Powell, 2011; Russell & 
Norvig, 2010; Sutton & Barto, 1998). RL problem formulations include three characteris¬ 
tics: sensation, action, and goal. An agent requires information regarding the state of the 
environment, the relevant action to affect the state, and a reward function mapping states to 
goals (Sutton & Barto, 1998). 

A RL problem in which the next state, s', only depends on the current state, s, that 
satisfies the Markov property, is an MDP, but RL is not limited to MDPs, though MDPs 
often serve as benchmarking environments for RL algorithms. RL can be successfully 
applied to non-Markov environments. RL tasks can be either episodic, with finite horizon, 
if the task breaks into natural subsequences, referred to as episodes, or continuing tasks, 
infinite horizon, if the task goes on continually without limit, with the long-term cumulative 
reward being treated in either an additive or discounted manner as described previously. 
Episodes end in unique terminal states, such as when a goal is achieved. 

An important distinction between techniques used to solve MDPs and the RL prob¬ 
lem is the degree of information available to the agent. The RL agent requires no apriori 
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model of the environment or the reward function, R(s). Instead it must learn through trial 
and error as Russell and Norvig summarize nicely, 

Imagine playing a new game whose rules you don’t know; after a hundred 
or so moves, your opponent announces, “You lose.” This is RL in a nutshell. 
(Russell & Norvig, 2010) 

The core components of a RL system, as defined by Sutton and Barto, are a policy, 7t(s), 
a reward function, Rfs), and a value function, Vfs) or the value of a an action taken in a 
given state, Qfs, a) (Sutton & Barto, 1998). Some systems make use of an internal learned 
model of the environment, but here we scope our research to model-free methods. 

Model-free methods seek to estimate the value of a state, V(s), or the value of a 
state-action pair, Q(s, a), and have proven successful in a range of application areas across 
several disciplines (D. Bertsekas & Tsitsiklis, 1996; D. Bertsekas, 2011; Russell & Norvig, 
2010; Powell, 2011). 

Model-based approaches refer to the class of RL approaches that maintain an in¬ 
ternal representation of the transition matrix describing its environment. These approaches 
attempt to leam the value of a state, Vfs), and the transition probabilities, Pff ’ s ', which often 
results in lengthened learning-times. The policy, 7t(s), maps states to actions, 7t(s) —> a, 
as discussed previously. Reward functions provide the agent with a numeric signal when 
the goal has been obtained by mapping perceived states to a reward, R(s) —> r . The value 
function, Vfs), or action-value function, Qfs, a), tracks the long-term value of a state or 
state-action pair, (s, a), using either additive or discounted methods. 

RL algorithms are often categorized by the method that they employ to form the 
estimate of the value of a state-action pair, (s,a), with the two main categories being 
Monte Carlo (MC) and temporal differencing (TD) methods. While both methods require 
samples of each state-action pair, (s, a), and the rewards, r, gained from experience with 
the environment, TD methods refer to those methods that form an expectation rather than 
learning directly (Szepesvari, 2010). 
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Two classic problems present themselves in RL, the exploration-exploitation dilemma 
and the temporal credit assignment problem. The exploration-exploitation dilemma refers 
to the problem of determining when to choose an action perceived as non-optimal. If an 
agent always chooses a greedy option, it has a stationary greedy policy, then it risks miss¬ 
ing out on a potentially higher reward from an untried (s, a). It must balance this against 
the need to maximize long run expected utility. This problem occurs not only in RL, but 
in sequential decision making problems in general (Robbins, 1952; Ross, 1982). We will 
discuss several non-stationary, stochastic, policy strategies to address this problem. 

The temporal credit assignment problem refers to the challenge of efficiently as¬ 
signing rewards to state-action pairs when rewards are delayed (Szepesvari, 2010). This 
challenges appears in tasks requiring the completion of multiple steps prior to reach the 
goal such as in a path finding task. How does one identify which actions attempted during 
the episode contributed to the achievement of the goal? 
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Figure 4: Information processing model of human cognition. 


Cognitive architectures are simulation-oriented models of individual human infor¬ 
mation processing and behavior that emphasize specific aspects of cognition, depending on 
the use case (Langley et al., 2009; Zacharias et al., 2008). The National Research Council 
identifies cognitive architectures for their relevancy to three core areas of military model¬ 
ing: analysis and forecasting in planning, simulation for training, and design and evaluation 
for acquisition (Zacharias et al., 2008). Cognitive architectures typically incorporate some 
derivative of the human information processing model, see Figure 4, which is minimally 
described as containing functions for perception, sense-making, meta-cognition, long-term 
memory, and action-selection (J. K. Alt et al., 2011; Klein, 1993; Wickens & Hollands, 
2000). The incorporation of RL within existing cognitive architectures, such as ACT-R 
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and SOAR, has been positively received and expanded the flexibility of these architectures, 
showing promise in developing adaptive agents that provide credible representations of hu¬ 
man behavior within complex DoD simulation models (Laird, 2008; Laird & Wray III, 
2010). Within the DoD modeling, simulation, and analysis communities, models of de¬ 
cision making such as a recognition prime decision making (RPD) and Boyd’s observe, 
orient, decide, and act (OODA) loop have gained credibility as representing the essential 
functions of military decision making (Zacharias et al., 2008; Klein, 1993). 

One challenge in the use of existing cognitive architectures in military modeling 
and simulation is the data required to populate these models, whose intended use lies in 
understanding and representing detailed cognitive processing of human cognition at a fine¬ 
grained-level on specific tasks. The assumptions that must be accepted and the challenges 
of collecting data to populate these models make them challenging to incorporate into DoD 
analytic models due to the strict model and data validation requirements of these models 
(Cassenti, 2009). A practical cognitive architecture that streamlines data required while 
still meeting the functional requirements to support traceable autonomous agents within 
military modeling and simulation is required (J. K. Alt et al., 2011). 

C. CHALLENGES OF ADAPTIVE AGENT DECISION MAKING IN ANALYTIC 

AND TRAINING SIMULATIONS 

This section illuminates the challenges of using adaptive agents in simulation mod¬ 
els intended for analytic and training purposes. Simulation model development and sub¬ 
sequent analysis of the simulation output cannot be treated as independent activities. The 
simulation model must support providing insights into the issues for analysis identified by 
the analyst. The analyst must develop the simulation model and situational vignettes that 
represent the measurement space required to inform the analysis question. Many combat 
models used by the analytic community, such as COMBATXXI, IWARS, and OneSAF, 
provide the user the ability to construct combat scenarios and assign behaviors to agents 
within the simulation (DA, 1999; DMSO, 2004; Goerger et al., 2005). These models rep- 
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resent the physics of combat well, but a common shortfall exists in the representation of 
human decision-makers. Each model provides a rule-based mechanism for controlling be¬ 
havior, but suffers from a brittleness problem when the agent encounters situations that 
were not anticipated apriori. Agents designed for use in training simulations encounter 
the same challenges, but brittleness problems become more readily apparent due to the in¬ 
teraction with the trainee and can serve to distract the trainee from the intended training 
objectives. Other DoD models, such as the Cultural Geography (CG) model, are designed 
to represent aspects of the civilian population rather than combat forces. These agents rep¬ 
resent not a single human, but a population segment. Similar limitations in the behavior of 
agents exist in this class of simulation model (J. Alt, Jackson, Hudak, & Lieberman, 2009). 

Adaptable agents, such as those based on RL, provide a reasonable option to avoid 
the brittleness problem and to potentially reduce the overhead associated with scenario 
development and are based on an empirically derived conceptual model that is readily ex¬ 
plainable (Zacharias et al., 2008). Many issues arise in the use of adaptive agents within 
DoD simulation models intended for analysis such as: (i) is the agent decision model- 
based on a valid conceptual model of human decision-making?, (ii) what are the data re¬ 
quirements to populate the model and is it feasible to obtain from a valid source?, and 
(iii) are the results reproducible? These issues often serve as an impediment to the use of 
adaptive agents within analytic models, though there is general consensus that such agents 
would improve the efficiency of scenario development processes and the value of simula¬ 
tion analysis, particularly those analysis products intended to gage the value of capabilities 
designed to enable greater access to information. This dissertation stresses the need to uti¬ 
lize explainable and practical approaches to the use of adaptive agents with these issues in 
mind. 

D. DISSERTATION CONTRIBUTIONS AND ORGANIZATION 

The primary contribution of this research is the novel use of the exponentially 
weighted average reward as an action-value estimator in reinforcement learning systems 
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in order to address the temporal credit assignment problem in reinforcement learning. This 
relatively straightforward approach improves learning speed over dominant existing ap¬ 
proaches in task environments with noisy and delayed reward signals and improves perfor¬ 
mance in non-stationary environments, while reducing the number of parameters required 
to be specified by the system designer from 3, for current dominant approaches, to 2. Many 
real-world applications fall into this category of problem and in these cases delays in learn¬ 
ing or recovery can result in control system failures and lost resources. The results provide 
the modeling and simulation community with a method that speeds learning in these chal¬ 
lenging cases, reducing the time required to train autonomous software agents and the time 
required for agents to adjust to changes in their environment. These performance results 
carry over into each of the multiple modeling and simulation application areas examined in 
this research. 

A second major contribution of this research is a novel methodology and example 
application of the use of reinforcement learning as a means of verifying the reward struc¬ 
ture of a training simulation. The reward structure in a training simulation directly impacts 
trainee learning-time and outcomes. A weak reward signal will result in slower learning 
and a reward signal that rewards trainee actions that are not consistent with learning objec¬ 
tives will result in the trainee learning the wrong objectives. This research demonstrates 
the use of reinforcement learning to examine the reward structure and produce an example 
of the learned behavior, or policy, that can provide the training simulation designer feed¬ 
back on the student behaviors rewarded by the training simulation prior to the simulation 
ever touching student hands, allowing the developer the opportunity to identify and correct 
deficiencies prior to fielding. 

A third major contribution of this research is a methodology and application of 
reinforcement learning to address limitations of a discrete event simulation. This simulation 
is used to produce a feasible schedule for unmanned aerial assets that maximizes a value 
function by correctly pairing platforms with mission demands in the context of a combat 
scenario. The current approach employs a linear program that maximizes value over a 
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finite-time horizon, but fails to provide a feasible schedule that maximizes value in cases 
where high value targets appear beyond the time horizon or where high value emergent 
targets become available following the initial allocation. Further, this approach fails to learn 
to recognize the cues leading to these situations, as a human decision maker would over¬ 
time, and makes these mistakes consistently resulting in feasible, but non-value maximizing 
schedules. The value of the feasible schedule produced in different combat simulations 
for a given mix of unmanned platforms is used to inform acquisition decisions regarding 
unmanned assets, so the current tools limitations directly impact the representation of the 
value of a given mix to senior decision makers. Since the difference in the value lost to 
these cases varies across mixes, the analyst cannot know how this systematic issue effects 
results in the aggregate. This research demonstrates the use of reinforcement learning to 
address these cases and an approach that relaxes the strict requirement for fully observable 
demands currently imposed on the simulation. 

A fourth major contribution of this research a methodology and application of rein¬ 
forcement learning to represent human decision-making within a combat simulation. This 
straight-forward approach provides an empirically developed conceptual model of human 
decision making, important for eventual model validation, that facilitates dynamic decision 
making and allows agents to learn from interaction with their environment. This approach 
incorporates the novel use of reinforcement learning within hierarchical task networks, pro¬ 
viding the potential to enable adaptive decision making within complex behaviors. This has 
particular relevance for enabling agents that adapt to the behavior of an opposing force, as 
human decision makers do, as opposed to relatively brittle scripted methods currently in 
use. 

A fifth contribution of this research is the development and application of a novel 
practical cognitive architecture that facilitates the representation of human information pro¬ 
cessing and the inclusion of domain knowledge in a structured manner that enables the se¬ 
lective use of goal-driven reinforcement learning to represent human decision making. The 
cognitive architecture provides an understandable framework to incorporate the effects of 
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perception, working memory, and dynamic goal-setting within simulation agents. This is 
particularly relevant for analysis topics related to the value of information or the impact 
of networked sensors. The cognitive architecture also has relevance to the representation 
of civilian behavior in conflict areas, where the analysis focuses on the beliefs and inter¬ 
ests of a population and the cognitive architecture provides a organizing construct. This 
contribution was incorporated into a social simulation used to facilitate war-games that 
received a 2011 Army Modeling and Simulation Office award for excellence in analysis 
(http://www.ms.army.mil/about/awards/fyll-awards.html). 

The flow of the dissertation is as follows: Chapter II provides a review of rein¬ 
forcement learning and cognitive architectures. Chapter III specifies and documents the 
performance of a the novel use of an exponentially weighted average that takes advantage 
of continuous-time as an action-value estimator in a reinforcement learning system in order 
to addresses the temporal credit assignment problem in an intuitive manner. Chapter IV de¬ 
tails the application of RL to each of the applied military simulation use cases. Chapter V 
details the development and use of the cognitive architecture within a military social simu¬ 
lation. Chapter VI provides a summary of contributions and describes the path forward for 
future research. 
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II. REINFORCEMENT LEARNING AND COGNITIVE 
ARCHITECTURES FOR AUTONOMOUS AGENT 
DECISION-MAKING 


When an organism acts upon the environment in which it lives, it changes that 
environment in ways which often affect the organism itself. Some of these 
changes are what the layman calls rewards, or what are now generally referred 
to technically as reinforcers: when they follow behavior in this way, they in¬ 
crease the l ik elihood that the organism will behave in the same way again. 
(Ferster & Skinner, 1957) 

This chapter contains the foundations for the subsequent development of a new action-value 
estimator, describes measures by which the performance of RL algorithms are typically 
judged, and provides foundational material for the subsequent development of a practical 
situation-based cognitive architecture. 

The simulations that DoD analysts use to inform decisions are typically large and 
complex, with a large potential state space (Cioppa, Lucas, & Sanchez, 2004). The agents 
within these models typically make decisions based on simple decision rules or custom 
scripting for each scenario making code or behavior reuse difficult. These hard decision 
rules lock agents into set TTPs for a given decision situation. This makes the exploration 
of new capabilities, which potentially change TTPs, difficult and time-consuming. These 
relatively brittle decision mechanisms also preclude the representation of differing levels 
of experience and training present in real human decision makers (Zacharias et al., 2008). 
Adaptive agents are required to provide simulation models capable of representing the nu¬ 
ances of human decision makers in a more credible manner. Adaptive agents are needed 
that possess the following characteristics: 

• Empirically derived conceptual model of decision-making. 

• Capable of adapting behavior gracefully when environment changes. 

• Learn appropriate behavior in minimum time and with minimum error. 
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• Require few tunable parameters and initialization data. 


• Behavioral changes occur in a transparent and explainable manner. 

A variety of approaches for employing adaptive software agents exist in the liter¬ 
ature, but few have been explored within DoD simulation models (Zacharias et al., 2008; 
Shoham & Leyton-Brown, 2009). Several model-free RL methods have been proposed 
and documented in the RL literature including Q-learning and SARSA, but challenges in 
learning with noisy or delayed rewards still exist (Powell, 2011; Sutton & Barto, 1998; 
Szepesvari, 2010). Cognitive architectures of varying degrees of complexity have been ap¬ 
plied to this problem as well, such as ACT-R, SOAR, CLARION and PMFServ, though 
not to the the specific challenges of representing human decision makers in DoD analytic 
simulation models (Langley et al., 2009; Sun, 2007b, 2006, 2007a; N. Taatgen et al., 2006; 
Laird, 2008; Wray & Jones, 2006). The most promising of the current approaches, in 
terms of satisfying the characteristics described above, are model-free RL methods and in¬ 
formation processing based models of cognition that leverage domain knowledge, such as 
cognitive architectures. 

A. REINFORCEMENT LEARNING 

Reinforcement learning is learning what to do how to map situations to action 
so as to maximize a numerical reward signal. (Sutton & Barto, 1998) 

This section provides a review of RL literature from the field of artificial intelligence 
to include a brief summary of literature justifying its use as a conceptual model for human 
behavior representation. The relevance of RL as a foundation of human behavior and the 
broad utility of the algorithmic approaches developed in the artificial intelligence commu¬ 
nity in solving a applied problems from a variety of fields, including dynamic programming 
and operations research, are emphasized (D. Bertsekas & Tsitsiklis, 1996; Powell, 2011). 

RL within the artificial intelligence community was originally inspired by research 
in animal psychology and Thorndike’s “Law of Effect,” describing the effect of rewards 
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on an animal’s likelihood of selecting actions, still widely regarded as a basic founda¬ 
tional principle accounting for much of animal, including human, behavior (Lattal, 2010; 
Thorndike, 1911). 

Of several responses made to the same situation, those which are accompanied 
or closely followed by satisfaction to the animal will, other things being equal, 
be more firmly connected with the situation, so that, when it recurs, they will 
be more likely to recur; those which are accompanied or closely followed by 
discomfort to the animal will, other things being equal, have their connections 
with that situation weakened, so that, when it recurs, they will be less likely to 
occur. The greater the satisfaction or discomfort, the greater the strengthening 
or weakening of the bond. (Thorndike, 1911) 

Thorndike’s research on the Law of Effect influenced Skinner’s research in operant 
conditioning (Skinner, 1938). This research is widely documented and empirically derived, 
making it a feasible candidate for a conceptual model for human behavior representation in 
DoD simulation models. Acquisition of responses to stimulus by humans is still an active 
area of research in the experimental psychological and neuroscience communities (Lattal, 
2010; Nargeot & Simmers, 2011; Okouchi, 2009). The behavioral economics community, 
particularly those who study decision-making under uncertainty, has also continued em¬ 
pirical work in this area (Sundaram, 2005; Yi, Steyvers, & Lee, 2009). Duffy provides 
a summary of recent empirical work in the area of agent-based computational economics 
(Duffy, 2006). Several recent experiments have compared the results of human decision¬ 
making experiments with the results produced by popular RL methods providing support 
to the notion that RL can provide insight into human behavior in certain decision situa¬ 
tions (Acuna & Schrater, 2010; Ishida, Sasaki, Sakaguchi, & Shimai, 2009; Nedic, Tomlin, 
Holmes, Prentice, & Cohen, 2008; Steyvers, Lee, & Wagenmakers, 2009; Walsh & Ander¬ 
son, 2010; Lee, Seo, & Jung, 2012). 
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Reinforcement learning contributes to the theory of planned behavior which has 
been empirically studied as an explanation for behavior adoption (Ajzen, 1991). The theory 
of planned behavior places reinforcement learning in a social context, by incorporating 
societal norms, perceived behavioral control and attitude into the reward signal associated 
with a potential behavior. 

Reinforcement learning also overlaps with naturalistic decision-making (NDM), 
particularly Klein’s recognition-primed-decision-making (RPD) model (Klein, 1993, 2008). 
This model, based on empirical observations of decision-makers in primarily military and 
emergency services organizations, describes pattern-matching that goes on with expert 
decision-makers. Experts tend to recognize situations more readily than others, allow¬ 
ing them to identify the appropriate action to take based on their experiences that led to 
successful outcomes when previously in similar situations. 

Most recently, studies utilizing brain imaging technologies have identified patterns 
of reinforcement at the neuronal-level within the brain (Lee et al., 2012). Taken as a whole, 
this body of work provides an empirically developed conceptual model of individual human 
behavior, a requirement for DoD simulation models, upon which to base our implementa¬ 
tion (DMSO, 2004). 

Sutton and Barto provide a full chronological account of the impact of RL on the 
field of artificial intelligence (Sutton & Barto, 1998). Since the sequential decision problem 
arises in multiple applied settings, multiple fields developed techniques to address these is¬ 
sues with several disciplines arriving at similar solutions. Operations research developed 
the Bellman equations, which led to the development of the field currently referred to as 
dynamic programming, to address the optimal control problem (Powell, 2011). During the 
1980s, the RL community and dynamic programming community connected with Watkin’s 
Q-learning algorithm (Powell, 2011; Sutton & Barto, 1998; Watkins & Dayan, 1992). A 
third line of effort began under those who study control theory, motivated by physical 
operating processes in continuous-time, with continuous states and actions. This led to 
approximate dynamic programming, initially called heuristic dynamic programming. The 


20 



connections between this branch of research and RL were highlighted and enumerated by 
Bertsekas, who reclassified RL as neuro-dynamic programming (D. Bertsekas & Tsitsiklis, 
1996; D. R Bertsekas, 1995; Powell, 2011). Powell attempts to unite neuro-dynamic pro¬ 
gramming, RL, and dynamic programming as approximate dynamic programming (Powell, 
2011). This research is primarily focused on the RL problem as defined by Kaelbling et 
al. and Sutton and Barto and the algorithms and techniques developed and applied in the 
artificial intelligence community (Kaelbling et al., 1996; Sutton & Barto, 1998). 

The exploration-exploitation dilemma refers to the need to balance the ratio of ex¬ 
ploratory actions and greedy action in order to maximize long-term reward. Since the RL 
agent must explore its environment to understand its reward structure, but is also tasked 
with maximizing its long-term reward, balancing the ratio between greedy and exploratory 
actions is of critical importance. 

The temporal credit assignment problem refers to the need to efficiently assign re¬ 
wards to state-action pairs when rewards are delayed, as in path finding tasks. Delayed 
rewards make it difficult to determine which of the state-action pairs attempted during the 
episode contributed to the achievement of the goal (Sutton & Barto, 1998). 

Dominant approaches to this problem rely on hill-climbing approaches known as 
temporal-differencing (TD) methods that can be slow to learn, especially in environments 
with noisy reward signals. In order to address this problem, a family of value functions that 
incorporate a bias parameter in conjunction with standard TD methods tend to dominate the 
literature. These techniques require the setting of three tunable parameters, making their 
implementation challenging, and still often result in slow learning-times. These techniques 
do not explicitly take into account environment time, the time scale that controls the dy¬ 
namics of the environment in which the RL agent is operating, making them less sensitive 
to changes in the environment that can affect performance. 
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In the remainder of this section, we provide additional background on temporal- 
differencing methods (TD), Monte-Carlo methods (MC), and temporal-differencing with a 
bias parameter (TD(A)) methods, and detailed coverage of four relevant model-free meth¬ 
ods. 


1. Temporal Differencing, Monte-Carlo, and TD(A) Methods 

Approaches to model-free learning are broadly characterized as Monte Carlo, MC, 
or temporal-differencing, TD, methods, with TD(A) bridging the two approaches by incor¬ 
porating the notion of eligibility traces to address the credit assignment problem (Sutton 
& Barto, 1998; Russell & Norvig, 2010; Powell, 2011). Model-free methods estimate the 
value of a state, V(s), or the value of a state-action pair, Q(s, a), -the focus of our dis¬ 
cussion from here forward-without the need for an internal model of the environment’s 
dynamics and its associated learning cost. 

MC methods typically have to wait until the end of an episode and then update the 
value of all state-action pairs, (s, a), visited, while TD methods make updates as observa¬ 
tions occur using a noisy hill-climbing approach. It is still an open question as to which 
techniques, MC or TD converge faster, though both have been shown to converge to true 
estimates of the target value and the optimal policy under asymptotic conditions seldom 
present in real-world applications (Sutton & Barto, 1998). MC methods sample returns 
from completed episodic tasks to estimate the value of a state, V(s), as described by Sut¬ 
ton (Sutton & Barto, 1998). First-visit MC and every-visit MC are two well-documented 
techniques, which recursively update the sample mean for each state using either first-visit 
or every-visit updating following the end of an episode, with a t = where i is the count 
of the visits to a state, 

V(s) = V(s') + cti(r-V(s')). (9) 

First-visit MC updates this general equation using information from only the first 
visit to each state during an episode, while every-visit MC uses information available from 
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each visit during an episode with discounting incorporated most recently by Szepesvari, 
see Algorithm 11 and Algorithm 12 in Appendix A (Szepesvari, 2010). 

MC methods’ ability to learn directly by sampling from the environment make them 
attractive, but can result in longer learning-times than TD algorithms, however, conceptual¬ 
ization of sequential decisions problems as a series of bandit problems has led to their use in 
planning algorithms, such as Upper Control bound for Trees (UCT) (Kocsis & Szepesvari, 
2006; Sutton & Barto, 1998). The relative simplicity of these techniques makes them at¬ 
tractive for a variety of applications, though they still make use of a hill-climbing process in 
the update (Bouzy & Chaslot, 2006; Valgaeren, Croonenborghs, & Colleman, 2009; Szita, 
Chaslot, & Spronck, 2010; Asmuth & Littman, 2011). TD RL methods combine MC meth¬ 
ods and those from dynamic programming to estimate the value of a state (Sutton & Barto, 
1998; Kaelbling et al., 1996). Methods that make use of eligibility traces, such as TD(A), 
further combine TD and MC methods resulting in a class of methods with improved learn¬ 
ing rate and less bias than pure TD methods according to the literature (Sutton & Barto, 
1998; Szepesvari, 2010; Powell, 2011). Note that the methods discussed so far all typically 
are applied to learn the value of a state rather than a state-action pair. An eligibility trace, 
as defined by Sutton, tracks the eligibility of a state, or state-action pair, to receive credit 
for future rewards. These techniques require the use of a third parameter, A in addition to 
the discount rate, y, and the learning rate, <x. 

TD methods update the value of each state-action pair, Q(s, a), using the most 
recent reward, r, and the estimated value of the next state-action pair, Q(s', a'), rather than 
waiting until the end of the episode (Sutton & Barto, 1998). TD(A) combines elements of 
MC and TD methods into a single framework to estimate the value of each state, V(s), 
through the use of eligibility traces, see Algorithm 13 in Appendix A (Sutton & Barto, 
1998; Szepesvari, 2010). 

In order to apply TD techniques to determine the value of a state-action pair we rely 
on algorithms such as Q-learning and SARSA, two dominant model-free algorithms from 
this class. In order to apply TD(A) techniques to estimate the value of a state-action pair, 
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we rely on the extensions of these algorithms which incorporate the notion of eligibility 
traces, Q(A) and SARSA(A). We provide a brief review of these four TD methods in the 
next section. 


a. Q-learning and Q(\) 

Q-learning is an off-policy TD approach requiring no internal model of the 
environment often used in RL applications. Let, 

6 = t' + ymax a Q(s', a) — Q(s, a), 

Q(s, a) <- Q(s, a) + a6, (10) 

where y is a discount factor and oc is referred to as a learning rate or step size parameter, 
see Algorithm 1. 

Q-learning approximates the optimal action-value function, Q*, regardless 
of the policy, tc, as long as each state-action pair can be visited an infinite number of 
times (Watkins & Dayan, 1992). This relies on the Robbins-Monro conditions, regarding 
the size and sequence of the learning rate parameter, a, stated here: X.k=i ~ 00 an d 
< °°- Convergence to the optimal policy, tc*, is guaranteed for the sample 
average case, where the learning rate oc^ = but not for constant learning rate, a. Q- 
leaming is known to systematically overestimate the value of the optimal action in a given 
state, Q*, but remains an important algorithm due to its simplicity and effectiveness, which 
in applications where precise estimate are not as important as simply identifying the optimal 
policy, tc* , is not impacted by this overestimation (Thrun & Schwartz, 1993). 

The convergence rate of Q-learning is very dependent on the sequence of oc. 
A typical modification to ensure convergence is to gradually decay the learning rate as a 
function of time or samples. Note that Sutton makes a claim that the lack of convergence 
is a benefit in non-stationary environments and that a fixed, but properly tuned, constant 
value of ol is required for practical applied work (D. R Bertsekas, 1995; Sutton & Barto, 
1998). 
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Kaelbling notes that although the algorithm seems to be the most effective 
for delayed reward domains it does not address issues regarding large state spaces and con¬ 
vergence time can be long (Kaelbling et al., 1996; Szepesvari, 2010). Though one of the 
first model-free algorithms proposed, Q-learning’s importance as a benchmark algorithm, 
and its utility in application, are demonstrated through its continued application and its 
many extensions (Powell, 2011; Sutton & Barto, 1998; Szepesvari, 2010). As late as 2011, 
Yu and Bertsekas proposed a new version of Q-leaming intended for use in the stochas¬ 
tic shortest path problem with reduced computational cost due to the incorporation of a 
solution of the optimal stopping problem (Yu & Bertsekas, 2011). 

Algorithm 1 Q-learning 

1: Parameters: s, the current state; Q(s,a), current value estimate of a state-action pair; y, 
discount rate; a, learning rate. 

2: Initiali z e Q(s, a) arbitrarily. 

3: Initialize s. 

4 : Return a using 7t(s)(ie.e — greedy, (3). 

5: for For t in T: do 

6: Take action a, observe r', s'. 

7 : Return a' using 7t(s')(ie.e — greedy, (3). 

8: 6 = r' + ymax a Q(s', a) — Q(s, a) 

9: Q(s, a) i — Q(s, cl) T <xb 

10 : end for 

11: Increment t. If t ^ T, go to line 3. 


Q(A) combines Q-learning with eligibility traces, with alternative versions 
from Peng and Watkins. Focusing on Watkin’s version, the general idea is to reset the el¬ 
igibility trace, e(s, a), for each exploratory action, which can result in poor performance 
when the ratio of exploratory actions is large (Sutton & Barto, 1998). The eligibility trace, 
e(s,a), for the greedy case is incremented by one for the chose action as described previ¬ 
ously, but the value is only decayed by A if the action chosen is the optimal greedy action. 
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Algorithm 2 Q(A) 

1: Parameters: s, the current state; Q(s,a), current value estimate of a state-action pair; 
e(s,a), current eligibility to receive credit of current state-action pair; y, discount rate; 
a, learning rate; A, decay rate. 

2: Initialize Q(s, a) arbitrarily. 

3: Initiali z e s. 

4: Return a using 7t(s)(ie.e — greedy, (3). 

5: for For t in T: do 

6: Take action a, observe r', s'. 

7: Return a' using 7t(s')(ie.e — greedy, (3). 

8: 6 = r' + ymax Q Q(s', a) — Q(s, a) 

9: e(s, a) e(s, a) + 1 

10: for all (s,a): do 

11: Q(s,a) Q(s, a) + a6e(s, a) 

12: if a = a*: then 

e(s, a) yAe(s, a) 

13: else 

e(s, a) 0 
14: end if 

15: end for 

16: end for 

17: Increment t. If t ^ T, go to line 3. 


Otherwise the eligibility trace is set to zero, e(s,a)■<— 0, prior to the next decision cycle, see 
Algorithm 2. 

b. SARSA and SARSA(A) 

SARSA (State-Action-Reward-State-Action) is an on-policy TD method 
distinguished from Q-learning by the timing of the backup. Q-learning backs up the 
best estimate of the value of a state-action pair, Q(s, a), during each observation period, 
while SARSA backs up the value of each state action pair, Q(s, a), following the exe¬ 
cution of each action (Powell, 2011). SARSA is generally considered less flexible than 
Q-learning (Russell & Norvig, 2010). SARSA shares the same limitations in regard to 
the choice of learning rate parameter, a, as described above for Q-leaming. Here let 
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Algorithm 3 SARSA 

1: Parameters: s, the current state; Q(s,a), current value estimate of a state-action pair; y, 
discount rate; a, learning rate. 

2: Initiali z e Q(s, a) arbitrarily. 

3: Initiali z e s. 

4: Return a using 7t(s) (ie. e-greedy, |3). 

5: for t in T: do 

6: Take action a t , observe r t+ i, s t +i. 

7: Return a t+1 using 7t(s t+ i )(ie.e — greedy, (3). 

8: b — r t+ i T- yQ(s t+1 , a t+ i) — Q(s t , a t ) 

9: Q(s t , a t ) <— Q(s t , a t ) + cx6 

10 : end for 

11: Increment t. If t ^ T, go to line 3. 


b = r' + ymax a Q(s', a') — Q(s, a), 

Q(s, a) <- Q(s, a) + aS, (11) 

and we see that the difference between Q-learning and SARSA is in the 6 term only. 
SARSA assumes the agent is in state s, chooses action a according to the policy being 
followed, 7t(s), observes a reward, r, and the next state, s', prior to choosing a second 
action a based on the same policy, see Algorithm 3. Bertsekas equates SARSA with a 
variant of optimistic policy iteration with no particular convergence guarantees and Powell 
describes SARSA as oscillating about the true estimate in most cases with no guarantees 
on error bounds (D. Bertsekas & Tsitsiklis, 1996; Powell, 2011). 

SARSA uses the same policy to decide which action to evaluate and to 
choose the next action one step in the future. Powell refers to the policy that determines the 
action to take as a behavior policy in physical systems and a sampling policy in a simula¬ 
tion setting. The policy that determines the action that is best is the target policy in RL, he 
refers to it as the learning policy. SARSA combines learning and policy sampling resulting 
in on-policy learning, while Q-learning employs different learning and sampling policies 
(Powell, 2011). 
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SARSA(A) uses eligibility traces, e(s, a), as we saw in Q(A), but it does not 
reset traces on exploratory actions. This leads to reports of better empirical performance 
than Q(A), see Algorithm 4. All the estimators we have discussed must be paired with a 
policy that balances the exploration and exploitation trade-off. There are no convergence 
guarantees associated with SARSA(A) or Q(A) (Sutton & Barto, 1998). 

Algorithm 4 SARSA(A) 

1: Parameters: s, the current state; Q(s,a), current value estimate of a state-action pair; 
e(s,a), current eligibility to receive credit of current state-action pair; y, discount rate; 
a, learning rate; A, decay rate. 

2: Initialize Q(s, a) arbitrarily, e(s, a) = 0, V(s, a). 

3: Initialize s. 

4: Return a using 7t(s)(ie.e — greedy, (3). 

5: for t in T: do 

6: Take action a, observe r', s'. 

7: Return a' using 7t(s')(ie.e — greedy, (3). 

8: 5 = r' + yQ(s', a') — Q(s, a) 

9: e(s, a) e(s, a) + 1 

10: for all (s,a): do 

11: Q(s, a) Q(s, a) + a§e(s, a) 

12: e(s, a) yAe(s, a) 

13: end for 

14: end for 

15: Increment t. If t ^ T, go to line 3. 


The four RL algorithms reviewed here, Q-learning, Q(A), SARSA, and 
SARSA(A), provide a representative sample of the dominant methods employed in the 
model-free estimation of the value of a state-action pair in practice. They all make use of 
a noisy hill-climbing approach, which can result in slowed learning in noisy or delayed 
reward environments. The use of a bias parameter, A, to control bias through the use of 
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eligibility traces, adds another parameter that must be calibrated in addition to the discount 
rate and the learning rate. These methods also do not make use of continuous-time informa¬ 
tion from the environment within which the RL agent is operating, instead simply relying 
on an action counter. This raises the question of whether a more straightforward approach 
that takes advantage of the historical average reward and makes better use of environment 
time could result in faster learning, particularly in environments with noisy and delayed 
rewards. 

2. Exporation and Exploitation 

The exploration-exploitation trade-off presents itself in many applied sequential 
decision making applications (sequential sampling, control theory, dynamic programming) 
(Robbins, 1952; D. Bertsekas & Tsitsiklis, 1996; Sutton & Barto, 1998; Powell, 2011). 
The exploration-exploitation dilemma in RL refers to the challenge of knowing when to 
choose an action that does not adhere to the optimal, or greedy, policy. If an agent placed 
in an unknown MDP adheres to a stationary policy that maximizes its long-term reward 
it runs the risk of trying one action, receiving a single reward and then cycling on that 
action since it will be the only action associated with a reward (Powell, 2011; Sutton & 
Barto, 1998). An agent following such a strategy could miss out on potentially greater 
reward to be achieved by attempting previously unexplored actions. This need to explore 
the state-space is a requirement for RL systems to function effectively. Robbins described 
this problem as a sequential sampling problem and is familiar to those interested in control 
theory as well as dynamic programming (Robbins, 1952). 

The policy selected to address this issue impacts the quality of the estimate provided 
by the value function since it controls sampling. Stochastic policies, such as e-greedy and 
Boltzmann-exploration, referred to in this research later for brevity as (3, guarantee the 
exploration of the full state-action space as n —* oo, with variations that allow the ratio to 
respond dynamically to the environment (Sykulski, Adams, & Jennings, 2010; Tran-Thanh, 
Chapman, Munoz De Cote Flores Luna, Rogers, & Jennings, 2010; Tokic, 2010; Nouris, 
2010 ). 
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Interval based techniques such as Upper Confidence Bound (UCB) base the level 
of exploration on the uncertainty of the estimate (Auer & Ortner, 2010; Powell, 2011). 
e-greedy specifies the probability, e, of taking an exploratory action explicitly, making it 
intuitive to most practitioners. Boltzmann exploration uses the estimated value of each 
state-action pair, Q(s,a), and a scaling parameter, t, to develop a distribution over the 
actions available from a given state, a G 7t(s). In contrast, the general idea of interval 
estimation is that those estimates with larger intervals will be sampled more often ensuring 
that as n —» oo, cr —» 0 for a stationary environment (Powell, 2011; Auer & Ortner, 2010; 
Szepesvari, 2010). Techniques for decaying the ratio of exploration and exploitation, e and 
t, are discussed in the literature to ensure greedy behavior as n —» oo, but for practical 
RL problems in non-stationary environments Sutton and Barto recommend the use of an 
appropriately “tuned” fixed ratio to avoid the inability to adapt to a changing environment 
(Powell, 2011; Sutton & Barto, 1998). 

a. e-Greedy 

It is common to see a mixed strategy employed in the literature. One of the 
simplest of these techniques is e-greedy, where an exploration rate specifying the probabil¬ 
ity of selecting a non-greedy action, e, is specified in advance. 

This strategy is popular for its simplicity and effectiveness (Powell, 2011; 
Russell & Norvig, 2010; Sutton & Barto, 1998). These methods choose a suboptimal step 
at random throughout the course of the learning period in conjunction with the specified e. 
This guarantees that as the number of action selection opportunities goes to infinity, that 
each action will be sampled an infinite number of times satisfying conditions for conver¬ 
gence for some classes of value functions, but with no guarantees in the applied case since 
infinity is a long time (Russell & Norvig, 2010; Sutton & Barto, 1998). 

Several derivatives of this approach exist in the literature such as e-first and 
e-decreasing. e-first simply places the fraction of the episode, of length T, that is intended 
to be exploratory at the beginning of the episode. So, the policy is dependent on the current 
step with random action chosen prior to eT after which the greedy policy is followed. The 
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general idea with e-decreasing is to allow e to decrease with the number of iterations. 
Versions of this policy exist that leverage the number of steps in an episode and the number 
of visits to a state in order to adjust the policy. Let N n (s) be the number of times an agent 
has visited state s by iteration n. 


s = 


c 

N n (s) ’ 


( 12 ) 


where 0 < c < 1 is our initialization constant. So an exploratory action is chosen with 
probability e 11 ( s) and the exploratory policy chooses a with probability and the prob- 
ability of the policy selecting action a in state s, P n (s, a) , is at least A . The result 
of this approach is that each state will still be visited infinitely often even as the level of 
exploration decreases since (Powell, 2011), 

00 y-oo 

5~P"(s,a) = 2 -’'-' M =oo. (13) 

„.i |A| 

Multiple variations of this approach are found in the literature that employ 
similar strategies to address the challenge of adjusting the ratio of exploration and exploita¬ 
tion (Powell, 2011; Sato & Kobayashi, 2000; Sorg, Singh, & Lewis, 2010; Sutton & Barto, 
1998; Tran-Thanh et al., 2010). e-least-taken for example, 


e = 1 -min(e“ t , 1), 


(14) 


where, 


4 


e n )t 4 + n 2> 


(15) 


and n is the number of times action a was selected prior to trial t (Sato & Kobayashi, 2000). 
One potential disadvantage of e approaches is that the random actions are selected with no 
regard to the information that has been obtained through previous sampling (Kaelbling et 
al., 1996). 
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Value difference based exploration (VDBE) is a another recent approach. 


i _ „ Yn —IQt+i (s,q) —Qt (s,a)| 
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1 f J 1 p „ n -|Qt + its,a) + Qt(s,a)| > 

P a 


(16) 


where cr is a positive constant called the inverse sensitivity that must be tuned for each 
environment and task (Tokic, 2010). Note, however, that the fraction of the time to explore 
is reported as more intuitive to set than the scaling parameter, the temperature, employed by 
Boltzmann exploration techniques which make use of the information available regarding 
the value of state-action pairs to inform the level of exploration and exploitation (Sutton & 
Barto, 1998). 


b. Boltzmann Exploration 

Rather than simply choose an action at random Boltzmann exploration, al¬ 
ternatively referred to as a softmax action selection rule, makes use of the estimated value 
of the action, making the probability of choosing an action proportional to the value of its 
estimated value (Powell, 2011; Sutton & Barto, 1998). In our use, we adapt Boltzmann as 
shown below to overcome numerical precision error as the estimated value of each state- 
action pair divided by the scaling parameter goes to infinity, ^ ( q.’ a - —* oo, an issue also 
identified by Hasselt as well (Hasselt, 2010). 
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where t is a parameter known as the temperature. This parameter serves as a scaling 
parameter making the probability of selecting a greedy action go toward 1 as T —* 0 and 
producing a more exploratory sampling policy with larger values of t. As with the e-greedy 
techniques the level of exploration is often decayed over the course of the episode. Similar 
strategies are employed to achieve this effect. 

In the following section, we will discuss two techniques that leverage in¬ 
formation regarding the statistical quality of that estimate to determine the appropriate 
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exploration policy, interval estimation and upper confidence bound. For detailed compari¬ 
son of Boltzmann and e-greedy exploration that discusses when the techniques should be 
equivalent, see Appendix B. 

c. Interval Estimation and Upper Confidence Bound 

Methods from statistical sampling and experimental design have been ap¬ 
plied successfully to this area. Interval based techniques, such as interval estimation, use 
measures of the uncertainty associated with an estimate, such as the variance or standard 
deviation, to determine the exploration policy (Kaelbling et al., 1996; Powell, 2011). The 
interval estimation algorithm, attributed to Kaelbling, takes this approach. The algorithm 
treats the action selection problem by maintaining an estimate of the expected value of an 
action as well as the standard deviation. The policy is, 

7r(s) = max(0 + z a cr£) (18) 

a 

where crj) is the estimate of the standard deviation of Q(s, a) and z K is a parameter that de¬ 
termines the size of the interval. The general idea being that as the number of observations 
of a (s, a) goes to infinity the standard deviation goes to zero, for a stationary environment. 
(Strehl & Littman, 2005, 2004). The upper confidence bound sampling algorithm (UCB) 
takes a similar approach, using a different scaling factor to apply to the standard deviation. 

7t(s) = arg max{0 + C ma \/-^-} (19) 

a y N(s, a) 

where C max is the maximum possible contribution, but in practice an estimate of the 
95th percentile for is used, and N(s, a) is the number of times action a has been sampled 
(Powell, 2011). 

d. Other Recent Approaches 

Peeters et al. describe a learning automata approach to dynamically ad¬ 
just the temperature on a Boltzmann function based on the action probabilities of multiple 
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agents (Peeters, Kononen, Verbeeck, & Nowe, 2008). Still and Precup provide an infor¬ 
mation theoretic approach to the exploration and exploitation problem, suggesting two ap¬ 
proaches that incorporate the prediction accuracy (Still & Precup, 2012). Scott approaches 
the problem from a Bayesian viewpoint, using observation and MC methods to provide a 
parameter free method for determining when to explore and when to exploit. 

Rather than adjust the policy-based on the variance of the estimate. Moody 
accounts for uncertainty by adjusting the value estimate itself (Moody & Saffell, 2001; 
Moody, Liu, Saffell, & Youn, 2004). He penalizes the state-action value estimate using 
techniques based on Sharpe’s ratio, 
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where the denominator represents the standard deviation of the long-term estimate of the 
value of a given (s, a) at time t, and the Sterling ratio, 
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where the denominator represents the maximum negative change in the value of the long¬ 
term estimate observed up to the current time (Moody & Saffell, 2001; Moody et al., 2004). 
Using the Sterling ratio as a basis, he develops his downside deviation ratio (DDR), which 
penalizes the value estimate for downside variation, rewarding large positive returns and 
penalizing negative returns only, since his primary application for this technique is stock 
trading, where upside variance is often desireable. 
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This approach accounted for uncertainty by biasing the estimator against 
poorly performing state-action pairs, introducing estimation error for the value estimator, 
but positively impacting overall performance (Moody & Saffell, 2001; Moody et al., 2004). 
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3. Measures of Learning Performance 

The value estimate calculated by the agent for each state-action pair provides the 
agent with a measure of goodness associated with each state-action pair and the accuracy 
of this approximation provides a means of comparing the relative performance of algo¬ 
rithms on similar tasks (Kaelbling et al., 1996). In addition to value approximation error, 
additional measures regarding the performance of algorithms and policies are eventual con¬ 
vergence to the optimal policy, speed of convergence, and regret. A discussion of these 
measures and their practical usefulness is described below. 

a. Convergence 

Several RL algorithms have been shown to be provably optimal in the limit, 
such as Q-leaming (D. P. Bertsekas, 1995; Kaelbling et al., 1996; Powell, 2011; Watkins 
& Dayan, 1992). Convergence in the limit does not provide practical benefit in most real- 
world applications and this measure while relevant to the theoretical body of knowledge of 
the community, does not provide a ready means of evaluating an algorithms performance. 

b. Speed of Convergence 

This measure suffers from the same problem as convergence in the limit, 
since guarantees of optimality often only arise in the limit. Kaelbling suggests a met¬ 
ric such as the speed of convergence to near optimality, but concedes that this is also ill- 
defined and that measures based purely on speed of convergence might lead to unintended 
consequences in algorithm selection, such as incurring excessive penalties on the road to 
optimality (Kaelbling et al., 1996). On practical option often used to measure learning 
speed for sequential tasks in the literature is simply the mean number of times a goal state 
is achieved in a set period. 

c. Regret 

Regret provides a useful measure that captures the same information as mea¬ 
sures related to convergence, but in a meaningful metric. Regret is defined as the expected 
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loss of reward or utility due to a deviation from the optimal policy (Audibert, Munos, & 
Szepesvari, 2007; Kaelbling et al., 1996; Kakade, Lobel, & Nazerzadeh, 2010; Kuleshov 
& Precup, 2010; Blum & Monsour, 2007). 

T 

L T = Tmaxr(a) — V~ R t , (23) 

a ^— 

t=l 

where T is the total number of action selection opportunities in the episode and r(a) is 
the reward received for action a. Kaelbling et al. cite a preference for this measure, but 
laments the fact that results documenting the regret of RL algorithms is not often available, 
though this has changed in more recent literature (Kaelbling et al., 1996; Blum & Monsour, 
2007; Kocsis & Szepesvari, 2006; Tran-Thanh et al., 2010). Regret is used extensively as a 
measure of performance of algorithms used to solve n-arm bandit problems. For a Bayesian 
view of this problem see Appendix C. 

d. Approximation Error 

The accuracy of the value functions itself can provide insight into an al¬ 
gorithms performance as well especially when measured in conjunction with regret. The 
approximation error is the difference between the true value of a s, V(s) , and the approx¬ 
imation arrived at by RL algorithm, V(s) . This measure can be examined for each s or, 
as suggested by Powell, as an aggregate across all s based on the frequency with which 
s is visited. Using his notation where v 1 11 is the error associated with the first algorithm 
to be compared, p (1 -, is the fraction of the time that the algorithm visited a (s, a) given 
algorithm one, 

v (1) = ^p (1) ||V(s) — V ( l)(s)||. (24) 

S 

This measure treats all states, and could be applied to the estimates of state-action pairs, 
equally in the evaluation of the algorithms performance. Alternatively, based on the ap¬ 
plication, a weighting scheme might be appropriate tailoring error bounds based on the 
state-action pair’s relevance to the achievement of the task goal (Powell, 2011). 
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e. Frequency of Optimal Action Selection 

This measure provides an objective measure of performance bases simply 
on a ratio of the number of times the best action was selected for a given state and its com¬ 
plement (Kuleshov & Precup, 2010). This metric can be shown over-time or treated as an 
episodic measure. This is independent of the reward structure and the discount parameter 
often used in the estimation of the long-term expected reward of a state-action pair. During 
a learning session one would expect that this value would start out fairly low and would 
move toward 1 as the learning period progressed, but is highly dependent on the choice of 
stochastic policy and the ratio of exploration and exploitation. Exploration and exploitation 
is just one consideration in placing RL into a more human-like setting. Cognitive architec¬ 
tures provide a potential framework to allow the incorporation of RL methods into a model 
of the human information processing system. 

B. COGNITIVE ARCHITECTURES 

Cognitive architectures show great potential, but several limitations to their use in 
modeling and simulation applications currently exist. The first of these limitations is the 
lack of consistent benchmark environments and methodologies. This poses problems for 
validation efforts, suggesting that a context or use case-based approach might be more 
appropriate (Zacharias et al., 2008). A second limitation is the time required to develop 
and populate cognitive architectures. This relates to the third limitation, a limited ability 
to learn. New approaches to learning are required to overcome this brittleness problem 
encountered when architectures reach the limits of their domain knowledge. Another con¬ 
sideration, rather than limitation, is identifying the correct level of abstraction when repre¬ 
senting human cognition for each use case. This is important since the ability to represent 
fine-grained detail within the information processing system exists, but the technology to 
verify and validate these structures does not. Future work is required in the following 
areas: tools to facilitate architecture development and instantiation, methodologies to facil¬ 
itate knowledge base development and enhance explainability of developed models, meth- 
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ods for modeling groups, identifying the appropriate use of learning, and use-case based 
validation (Zacharias et al., 2008). This section will provide an overview of the general 
information-processing model of cognition, issues with cognitive modeling, and several of 
the more prominent cognitive architectures currently in use. 

1. Considerations in Cognitive Modeling 

Human cognition can be conceptualized using the information processing model 
formulation, which serves to enable experimentation and analysis as well as hypothesis 
generation (Wickens & Hollands, 2000). It is important to note that this model does not 
strive to identify where in the brain processes take place, but simply acknowledges that 
these processes must occur somewhere, leaving it to others to conduct empirical studies for 
the purpose of identifying which sections of the brain are involved with the different pro¬ 
cesses required by the general information processing model (Anderson, 2005; Anderson 
& Schunn, 2005). The model in Figure 5 is representative of the information processing 
model as related by Wickens and generally accepted in the human factors and cognitive sci¬ 
ence communities (Wickens & Hollands, 2000). Note that the software agent framework 
from Russell and Norvig mirrors this functionality at a more abstract-level. This should not 
be surprising, since artificial intelligence seeks to develop more capable software agents by 
emulating and understanding human behavior (Russell & Norvig, 2010). 
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Figure 5: Information processing model of human cognition. 

Information from the environment received by the sensory receptors enters into per¬ 
ception through sensory processing. Selective attention assists an observer in identifying 
which cues from the environment are relevant for processing given the current task and sit¬ 
uation. Perception serves to provide meaning to the information received from the sensors, 
sometimes referred to as sense-making. 

Two important characteristics of perceptual processing are that it is generally thought 
of as automatic and that it is driven by both bottom-up processing of sensory inputs and 
top-down processing of information from long-term memory regarding the current set of 
percepts from the environment (or situation) and expectations based on past experience. 
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Wickens distinguishes between perception and cognition based on the processing time as¬ 
sociated with each, stating: 

cognitive operations generally require greater time, mental effort or attention. 

(Wickens & Hollands, 2000) 

Cognition typically involves operations that utilize working memory. Working 
memory is characterized as a finite resource, that is highly vulnerable to disruptions based 
in attention. Once information has been rehearsed above a threshold it moves into long¬ 
term memory, to be retrieved as needed based on associations with other information from 
the current perceived state. Note the similarity with RL, whose main objective is to learn 
to map states to actions through a similar mechanism of association. Wickens provides 
only a brief treatment of response selection and execution, generally stating that an action 
from the potential actions is selected and in essence scheduled for performance just as in a 
discrete event system. The feedback loop is intended to represent two points (Wickens & 
Hollands, 2000): 

• Flow of information can begin at any point in the system. 

• The feedback loop is near continuous. 

This is also a brief allusion to perceptual control theory a complementary view that 
makes the case that humans are control systems that rely on perception to maintain control 
of the system and that behavior can be managed by controlling ones perception (Powers 
& Treval, 1973). The attention component shows the role of attention in the process. The 
allocation of this finite resource drives the formation of perception and response selection, 
impacting the entire feedback loop. Attention interacts with long-term memory, making use 
of prior experience to allocate attention. This use of prior experience to learn what elements 
of the current perceived state are important again calls to mind concepts from reinforcement 
learning. The concept of divided attention impacts perception significantly. Operators 
attending to multiple tasks in an uncertain environment must divide their attention between 
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cues supporting each of the tasks. This divided attention degrades their ability to correctly 
attend to either task. This is a prime example of the quantity versus quality trade-off in 
terms of task performance and highlights the relative conceptual importance of attention as 
a means of influencing performance and makes a strong argument against multitasking. 

The human information processing framework serves as a source of inspiration for 
placing realistic constraints on simulation based representations of human cognition. Taat- 
gen and Anderson identify several constraints on cognitive modeling from the perspective 
of trying to replicate human cognition (N. A. Taatgen & Anderson, 2008). These con¬ 
straints are proposed as considerations when developing cognitive architectures intended 
to replicate human cognition and serve as useful check points and points of comparison 
between architectures. 

The first of these is working memory capacity, which serves the purpose of main¬ 
taining a representation of the current task environment and whose limitation is supported 
by Millers (1956) empirical work as referenced by Taatgen and Anderson. The authors de¬ 
scribe the working memory constraint as one on cognitive function as opposed to a simple 
limiter on capacity. They further describe working memory as a relevance filter on percep¬ 
tion, ensuring that the cognitive system is focused on relevant information (N. A. Taatgen 
& Anderson, 2008). 

The second constraint on cognitive modeling is the serial bottleneck (N. A. Taatgen 
& Anderson, 2008). A debate exists regarding the presence of a bottleneck in central pro¬ 
cessing in human cognition centered around the allocation of attention to multiple streams 
of processing. It is thought that capacity constraints exist within peripheral processes, but 
that a more significant constraint exists in central cognition (Anderson, 2005; Anderson 
& Schunn, 2005). A common example used to illustrate this is the fact that humans can 
process multiple streams of perception and execute multiple simultaneous tasks, but can¬ 
not think about two things simultaneously Broadbent(1958) (Anderson, 2005; Anderson & 
Schunn, 2005). Through practice, however, humans have been shown to reduce the need 
for central cognition in select low-level cognitive tasks by developing a level of automatic- 
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ity that enables some level of parallel processing which is thought to assist in overcoming 
the constraint in central processing (Anderson, 2005; Anderson & Schunn, 2005). 

Perception and motor system performance can be informed in great detail from 
empirical work, at least in regard to performance measures such as response time and signal 
detection, but the problem of informing learning from empirical work is considered more 
challenging. Taatgen and Anderson consider a model more constrained if it is required to 
leam its own knowledge, from either direct instruction or feedback, as opposed to having 
the knowledge specified by the modeler. This is particularly relevant for RL, where the 
entire task is to learn from interaction with the environment. This highlights a common 
technique employed to speed RL, the incorporation of domain knowledge. Taatgen and 
Anderson also state that learning can occur through examination of the environment and 
the identification of patterns, very much in line with RL and Klein’s recognition-primed- 
decision making (N. A. Taatgen & Anderson, 2008; Klein, 1993). 

Neuroscience provides two additional constraints that arise in discussion of cogni¬ 
tive architectures: constraints at the level of individual brain cells and constraints at the 
global brain architecture level. Two views of the impact of individual brain cells appear 
in the literature. The first view, reductionism, is that the characteristics of individual brain 
cells are not required to develop an understanding of human cognition (N. A. Taatgen & 
Anderson, 2008). The second view is that the neurons serve as important constraints on 
cognitive architectures, but this view raises the binding problem and the catastrophic inter¬ 
ference problem (N. A. Taatgen & Anderson, 2008). The binding problem is how cognitive 
systems group features, or variables together. From an RL point of view, this speaks to the 
learning of the values of state-action pairs. The catastrophic interference problem refers to 
the fact that previously learned knowledge can be unlearned. 

Cognitive architectures can now be informed by the results of work from neuro¬ 
science that identifies regions of the brain associated with certain functions. One of the 
most interesting open questions regarding cognitive architectures is the level of detail nec¬ 
essary to achieve generally intelligent agents (Laird & Wray III, 2010; Laird, 2008). From 
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a practical applied standpoint this question is really, what level of detail is required for the 
use case, since as Box says, “All models are wrong, some are useful.” 

2. Recognition-Primed Decision Making and Goals, Operators, Methods, 
Selectors 

Recognition-primed decision making (RPD) developed out of the study of natural¬ 
istic decision making (NDM). NDM researchers seek to understand how individuals build 
expertise and apply it to decision making. In this paradigm, expertise is defined by explicit 
knowledge, in the form of facts and rules, and the more valuable tacit knowledge, such as 
the ability to recognize patterns, make subjective judgements, and make use of mental mod¬ 
els, see Figure 6. Explicit knowledge is well represented in by first-order logic type rules. 
Tacit knowledge gained over-time by experience with a given situation or state of the world. 
Recognition of a given state that had been experienced before and an understanding of what 
actions are appropriate in that state are at the heart of RPD. The recognition-primed deci¬ 
sion model describes how individuals match pattern to recognize a situation and depending 
on their expertise with that situation either leverage that experience to select an action or 
with less positive experience in a given situation they might mentally simulation potential 
outcomes of action alternatives (Klein, 1993, 2008). So, from this viewpoint, RL would 
be used to build tacit knowledge and as the RL algorithm converged this would lead to a 
transformation of those rules into explicit knowledge. 


43 






Figure 6: Recognition primed decision making (Klein, 1993). 


Goals, Operators, Methods, and Selection (GOMS) task analysis provides a useful 
framework for analyzing and representing goal-directed behavior cite Card,Moran, Newell 
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(John & Kieras, 1996; Olson & Olson, 1990; Oyewole & Haight, 2011; Craig et al., 2012). 
GOMS derived first order logic rules are a form of cognitive task analysis that reduces 
human interaction with a computer down to first principles. We can also use GOMS frame¬ 
works to drive autonomous behavior. In this sense, the goal is the objective the agent 
is seeking to obtain, the operators are actions that can be performed to obtain the goal, 
methods are sequences of operators that accomplish a goal, and selection rules adjudicate 
conflict resolutions between methods. This type of framework has been implemented ef¬ 
fectively in the form of first-order logic rules in productions systems, such as CLIPS. Many 
of the existing cognitive architectures leverage production systems and GOMS like rules to 
represent domain knowledge and drive behavior (N. A. Taatgen & Anderson, 2008; Sun, 
2007b; Laird & Wray III, 2010). 

3. Review of Cognitive Architectures 

This section will review three of the more prevalent cognitive architectures in broad 
use: Atomic Components of Thought (ACT-R), State, Operator, and Results (SOAR), and 
Connectionist Learning with Adaptive Rule Induction On-line (CLARION). 

a. Active Components of Thought 

ACT-R, whose development started in 1983, has traditionally focused on 
serving as a platform for research on cognition and representation of fundamental psycho¬ 
logical processes, see Figure 7 (Zacharias et al., 2008). ACT-R uses a combined form of 
symbolic and numerical representation with production rules firing based on log odds of 
success of a particular rule in a given situation, its activation level (Sun, 2007b). ACT-R 
deals with the notion of working memory capacity through its use of declarative memory 
which decays as the size of the information pushed in increases. In this manner, ACT-R 
constrains the agent’s information processing ability based on a capacity based represen¬ 
tation of working memory (N. A. Taatgen & Anderson, 2008). ACT-R has been used in 
a number of applied settings to include the modeling of adversarial behavior (Zacharias et 
al., 2008). 
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Figure 7: ACT-R top-level conceptual diagram. 


ACT-R imposes the serial bottleneck in central processing by allowing only 
one rule to fire at a time in its central production system and by limiting the number of cog¬ 
nitive steps that can be taken in a given decision cycle (N. A. Taatgen & Anderson, 2008). 
ACT-R further constrains its production system by limiting matches to items currently in its 
buffers implying that it cannot match against all items in its declarative memory and must 
instead retrieve and examine one item at a time. Learning occurs in ACT-R by combining 
rules over-time into new sets of behavior that are commonly associated with each other, 
transforming general knowledge into task specific knowledge over-time (N. A. Taatgen & 
Anderson, 2008). 
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b. State Operator and Results 

The SOAR architecture has gone through eight major versions between 
1982 and 2007, all maintaining a pure symbolic processing approach and using production 
system rules for long-term memory (Laird & Wray III, 2010; Laird, 2008). The traditional 
SOAR, up through SOAR 8, used symbolic long-term memory, representing knowledge 
as a set of production system rules, and short-term memory, which maintains a current as¬ 
sessment of the situation based on perception of the current state informed by long-term 
memory. 

SOAR is characterized as taking a functional approach to the representation 
of working memory capacity by Taatgen and Anderson, where limits on working memory 
serve to separate relevant from irrelevant information in the given task context (N. A. Taat¬ 
gen & Anderson, 2008). The general processing cycle in SOAR is to receive an input from 
perception into short-term memory, update the agent’s goal based on the new information, 
and evaluate and select operators to achieve the goal, much like a GOMS framework, see 
Figure 8. Fixed decision procedures then select the appropriate operator, with arbitrary 
conflict resolution mechanisms in place. The actions associated with the chosen opera¬ 
tor are executed by the rule-based system and the output passed to the environment for 
execution (Laird & Wray III, 2010; Laird, 2008). 

SOAR is capable of learning new rules from direct instruction has been used 
in a variety of applications ranging from expert systems to the control of autonomous agents 
(N. A. Taatgen & Anderson, 2008). SOAR has been used in military simulations to replace 
human role players and by the Institute for Creative Technologies (ICT) at the University 
of Southern California to control virtual characters for game-based training applications 
(Zacharias et al., 2008). Laird highlights extensions to the traditional SOAR in the latest 
version, SOAR 9, to provide capability for long-term memory representation, additional 
learning mechanism, and non-symbolic processing (Laird & Wray III, 2010; Laird, 2008; 
Wray & Jones, 2006). 
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Figure 8: SOAR top-level conceptual diagram. 


Laird states that the goal of SOAR 9 was to retain the strengths of the orig¬ 
inal SOAR while expanding SOAR to more closely resemble human capabilities (Laird & 
Wray III, 2010; Laird, 2008). The new version of SOAR incorporates a notion of activa¬ 
tion into its representation of working memory, inspired by ACT-R. RL is integrated into 
the selection of operators using an e-greedy algorithm. The concept of emotion is repre¬ 
sented in the current version of SOAR using appraisal theory. Appraisals lead to emotions, 
eventually being expressed as an intensity of emotion, which is used in conjunction with 
RL as part of the reward structure. Declarative knowledge can be built up from information 
that first moves into working memory. Episodic memory, task independent information 
that can be used to aid in reasoning across tasks, contains memory of experience over-time 
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providing an idea of context. This is believed to be a key enabling capability for mental 
simulation and prediction, two functions believed to enable decision making under condi¬ 
tions of uncertainty. 

c. Connectionist Learning with Adaptive Rule Induction On-line 

CLARION is a cognitive architecture consisting of four subsystems: the 
action-centered subsystem, the non-action centered subsystem, the motivational subsys¬ 
tem, and the meta-cognitive subsystem, see Figure 9. Each subsystem provides two levels 
of knowledge representation, a top-level for explicit knowledge representation and a bottom 
level for implicit knowledge representation. These bottom-up associations between action, 
state and outcome inform action selection (N. A. Taatgen & Anderson, 2008). Interaction 
occurs between the two levels during action selection and learning. The action-centered 
subsystem controls all actions, external to the agent and internal. The non-action centered 
subsystem stores and maintains general knowledge. The motivational subsystem deter¬ 
mines motivations for perception, action and cognition. The meta-cognitive subsystem 
controls the system of systems, providing central control (Sun, Zhang, & Mathews, 2006). 
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Figure 9: CLARION top-level conceptual diagram. 
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CLARION uses rule systems implemented in a neural network to constraiin 
the computational power of the system and focus on more local situations (N. A. Taatgen & 
Anderson, 2008). CLARION has mechanisms for implicit and explicit learning (N. A. Taat¬ 
gen & Anderson, 2008). CLARION is described by Sun as a hybrid connectionist-symbolic 
system and has been applied to modeling organizational decision-making, with CLARION 
based agents interacting as part of an organization and in an analysis of survival strategies 
of tribal societies. The authors conclude with the need to explore the implications of in¬ 
dividual cognition on political systems, justice and individual motivating factors (Sun & 
Naveh, 2007). 

4. Learning in Cognitive Architectures 

The degree of learning to include in agents depends on the purpose of the modeling 
effort and the level of sophistication in the agents. The general problem is to allow agents 
to leam which action choices lead to desirable outcomes within the agent environment. 
This has advantages over trying to script every possible state an agent might encounter, 
which will be inherently brittle. Learning in cognitive architectures intended for social 
simulation can be further characterized by the objective of modeling individual and group 
human behavior as realistically as possible. Multiple approaches to the problem exist with 
various advantages and disadvantages (Brenner, 2006; Sutton & Barto, 1998). The use of 
learning techniques facilitates the development of adaptive agents that update their behav¬ 
iors based on information received over the course of a simulation run. This capability 
enables dynamic communications patterns, the changing of affiliations over-time, and be¬ 
havior changes in response to situations in the environment. The use of learning agents can 
also potentially enable verification of the dynamics of the environment. 

ACT-R implements learning in the determination of activation levels in its declar¬ 
ative memory. The activation level of an item in declarative memory is linked to the fre¬ 
quency of its use, with decays in activation levels being experienced over periods of in¬ 
activity, in line conceptually with reinforcement learning. The activation level represents 
the probability that the item in memory will be needed given a certain state of the envi- 
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ronment. Production rules retrieve these chunks from memory through a buffer, with the 
activation level linked to the retrieval time. Each production rule has an associated util¬ 
ity value, continually updated by learning algorithms, based on the notion of a estimated 
cost and probability of reaching the goal state given the selection of that rule. ACT-R also 
adds items to declarative memory and constructs new production rules, similar to SOAR, 
by combining associated rules, rules fired sequentially, into a single rule (N. Taatgen et al., 
2006; Wray & Jones, 2006). ACT-R and SOAR contrast with CLARION in their approach 
to learning (Sun, 2007b). 

CLARION combines a bottom up and top down approach to learning, with agents 
that use model free methods from RL, such as Q-learning, to determine actions choices 
from the bottom up, under the supervision of top down rules and chunks similar to those 
found in ACT-R or SOAR. The output of both bottom-up and top-down processes is an 
action choice. Leedback for these processes comes to the agent in the form of an internal 
and external signal. Internal feedback exists in the form of motivation, external feedback 
must be specified by the modeler. Learning and action choices reside in the action centered 
subsystem with goals being specified by the motivational subsystem and the meta-cognitive 
subsystem. The meta-cognitive subsystem provides the agent with information regarding 
its own internal state based on internal feedback from the motivational subsystem (Sun, 
2007b). The use of motivation and its impact on goal setting is consistent with the updated 
view of Maslow’s hierarchy of needs. 

5. Agent Based Models, Cognitive Modeling, and Complexity 

Agent-based modeling is defined by the National Research Council as the com¬ 
putational study of systems that are complex in the following sense: (1) the systems are 
composed of multiple interacting entities and (2) the systems exhibit emergent properties 
properties arising from entity interactions that cannot be deduced simply by averaging or 
summing the properties of the entities themselves (Zacharias et al., 2008). Multi-agent 
systems as defined by Shoham and Leyton-Brown are those systems that include multi¬ 
ple autonomous entities with either diverging information or diverging interests, or both 
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(Shoham & Leyton-Brown, 2009). Ferber refers to multi-agent systems as societies of 
agents that interact and coordinate their behaviors toward some common goal (Drogoul 
& Ferber, 1994; Ferber, Gutknecht, & Michel, 2004). Ferber goes further to distinguish 
agent centered multi-agent systems (ACMAS), which he refers to as the classical view of 
MAS, from what he characterizes as organization centered multi-agent systems (OCMAS). 
Where ACMAS was centered on cognitive states of the agents being modeled, OCMAS 
focuses instead on roles, groups, tasks and interaction protocols. Gilbert, describing agents 
from a MAS point of view attributed to Wooldridge and Jennings, states that agents “are 
processes implemented on a computer that have autonomy (they control their own actions); 
social ability (they interact with other agents through some kind of language); reactivity 
(they can perceive their environment and respond to it); and pro-activity (they are able 
to undertake goal-directed actions) (Gilbert, 2008). The relationship between agent based 
models and multi-agent systems is loosely defined at best. Both leverage similar concepts 
and techniques, so it appears that the primary difference rests in the use case. 

Agent based modeling and multi-agent systems share in common the idea of intel¬ 
ligent agents that use some level of information from the environment in conjunction with 
rule sets or algorithms of varying complexity to select actions that allow them to interact 
with their environment and other agents. The line between these two areas is not distinct, 
with the clearest distinction possibly being formed by intended use and intended environ¬ 
ment. From this perspective, their approach is consistent with the definition of an agent 
from the field of artificial intelligence as stated by Russell and Norvig as anything that can 
be viewed as perceiving its environment through sensors and acting upon that environment 
through actuators (Russell & Norvig, 2010). Both fields focus on the use of software agents 
as members of complex systems. 

The discipline of cognitive modeling seeks to represent human cognition within 
software agents for a variety of uses through the development and application of cogni¬ 
tive models and cognitive architectures (Zacharias et al., 2008). Cognitive architectures, 
described in some contexts as micro-level formal models, are simulation based models of 
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human information processing often built to emphasize distinct aspects of cognition, based 
on the use case. Agent based social simulations represent human cognition at varying lev¬ 
els of sophistication (Zacharias et al., 2008), but typically adhere to the most rudimentary 
level of an agent, a reflex agent, as defined by Russell and Norvig and reviewed previously. 
Summarized here, an agent senses information, or percepts from its environment, using 
sensors, updates its internal representation of the world, and selects actions based on this 
updated internal state (Russell & Norvig, 2010). Depending on the needed level of reso¬ 
lution the agent can represent either an individual or group of individuals, though the use 
of cognitive architectures in the representation of group cognition is not well developed 
(Zacharias et al., 2008). Sun points out that agent based social simulations and cognitive 
architectures have developed in relative isolation from each other, but that the use of ap¬ 
propriate cognitive architectures could benefit agent based social simulation by providing 
a realistic basis for the representation of individual agents. Sun further identifies cognitive 
social simulation as a path forward for the intentional combination of these two fields into 
cognitive social simulation (Sun et al., 2006). While the potential for agent based social 
simulations and cognitive architectures to enable a multi-level examination of human be¬ 
havior including the sociological and psychological perspectives respectively exists (Sun, 
2007b), the National Research Council is less clear on the use of cognitive architectures to 
represent group cognition (Zacharias et al., 2008). 

A cognitive architecture provides the specification for those features of an intelli¬ 
gent agent or system, depending on the use case, that are invariant over-time and across 
applications (Langley et al., 2009). The development of cognitive architectures, fixed in¬ 
frastructures to support the acquisition and use of knowledge, springs from the pursuit of 
generally intelligent entities as opposed to specialized algorithms. Generally intelligent 
entities are distinguished by their ability to be applied to a wide variety of tasks, using 
diverse knowledge and experience in complex environments (Laird, 2008). Cognitive ar¬ 
chitectures differ from cognitive models, which can be thought of as more narrowly scoped 
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micro models while cognitive architectures generally seek to implement a unified theory of 
cognition (Zacharias et al., 2008). 

Sun defines a cognitive architecture broadly as a “domain generic computational 
model, capturing the essential structure and process of the individual mind (Sun et al., 
2006). Taatgen and Anderson describe cognitive architectures as “on the one hand echoes 
of the original goal of creating an intelligent machine faithful to human intelligence, and 
on the other hand attempts at theoretical unification in the field of cognitive psychology 
(N. A. Taatgen & Anderson, 2008). Multiple cognitive architectures exist and have been 
applied to a variety of use cases, but the common components of a cognitive architecture, 
as identified by Langley et al. and Laird, include short and long-term memory, language 
to represent elements stored in memory, and functional processes that leverage these struc¬ 
tures and is consistent with Wickens human information processing view (Laird, 2008; 
Wickens & Hollands, 2000). The National Research Council expands the list of cognitive 
and perceptual processes to include: attention, situational assessment, goal management, 
planning, meta-cognition, learning, action selection, and memory (Zacharias et al., 2008). 
Psychologically oriented cognitive architectures are important because they can facilitate 
the understanding of human cognition, can inform understanding of societal-level collec¬ 
tive behavior, and are relatively realistic and human like (Sun, 2007b). Cognitive architec¬ 
tures within social simulation serve to add structure and specificity to the key component 
of these models, the individual actors responses that provide the link between the micro 
and macro-levels. 

Agent based models and multi-agent systems focus on the use of intelligent agents 
within simulation for a variety of purposes. The level of sophistication the agents employ 
varies by use case in these settings as does the need to represent human like cognition 
and behavior. Cognitive architectures focus on the representation of human cognition and 
behavior by seeking to replicate the functions of the human information processing system. 
Sun proposes the creation of cognitive social simulations, which combine these disciplines 
allowing for more detailed representation of human behavior at the micro-level upon which 
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agent based modelings macro-level analysis rests (Sun et al., 2006). The efficacy of this 
approach has yet to be fully explored in the literature, but holds promise as a means to 
provide sophisticated behavior within agents while maintaining traceability to conceptual 
models required for validation. The link between the micro and macro system behavior 
level will be expanded below. 

The study of complex social systems go back hundreds of years with Adam Smiths 
The Wealth of Nations (1776) being one of the earliest discussions on the topic (Smith, 
1966). Smiths invisible hand guides self-interested rational agents into well formed market 
structures (Miller & Page, 2007). The traditional tools of social science often constrained 
the modeling of social systems. Tools developed in the study of complex systems, such 
as agent based modeling, have been identified and come into use by researchers studying 
complex social systems (Tesfatsion & Judd, 2006). The relation between complex systems 
research and complex social systems is especially relevant to understanding the transition 
of a society from a relatively unstable state, such as during an open insurgency, to the more 
stable post conflict state, with an inactive or non-existent insurgency. Understanding the 
perception, motivations, and intention of the diverse actors at play in a conflict ecosystem 
are essential to gaining an understanding of how the social system can be transitioned from 
the undesirable chaotic state of insurgency, to the more stable and desirable post conflict 
state (Mansoor, 2007). Stable, in this context, is certainly relative to the given social system 
and its concept of normal. 

Miller and Page state that complexity arises in a system when dependencies among 
the elements of the system become important to the normal behavior of the system to such 
an extent that the removal of one element from the system alters the systems behavior to 
an extent much greater that intuitively expected given the relative perceived contribution of 
the removed element (Miller & Page, 2007). These complex systems provide challenges to 
those seeking to understand their responses under varying conditions. Social systems tend 
toward complexity, with multiple intelligent social agents interacting through a variety of 
processes and connections. Systems such as these are non-linear and do not lend themselves 
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to analysis by decomposition, making holistic approaches more appropriate for system- 
level analysis (Miller & Page, 2007). 

The characteristics that make agent based simulation an attractive alternative for the 
analysis of complex systems make them a viable alternative for use in examining complex 
social systems. The application of agent based models to this domain also suffers from 
the same limitations to their application within the field of complex systems. In addition 
to these limitations, the use of these tools in the representation of social systems requires 
the representation, at some level, of human behavior. A lynchpin in the future validation of 
cognitive social simulation, at the macro-level, is the validation of the cognitive architecture 
used, at the micro-level. 

The connection between agent-based models and the social sciences is illustrated 
by the field of economics, which models economic behavior as the result of rational actors, 
with ties to models based on individual actors from its inception. Epstein highlights this 
relationship, pointing out that certain social systems are distributed, asynchronous, decen¬ 
tralized and dynamic, such as trade networks (Epstein, 2006). When market prices reach 
equilibrium the society has executed a calculation and converged on a solution. Conver¬ 
gence to social norms is equivalent to convergence to strategy distributions in n-person 
games or to an optimal policy when mapping state-action pairs in reinforcement learn¬ 
ing (Epstein, 2006). The first applications of simulation to the social sciences coincided 
with the first uses of computers in university research settings in the 1960s (Gilbert, 2008). 
Many of these early efforts focused on modeling for prediction rather than understanding- 
typically the goal of most social science research. While these early efforts met with mixed 
results, and served to generate healthy skepticism of simulation, an approach which gained 
traction was the idea of micro-simulation. Micro-simulation modeling represents individ¬ 
ual members of a population, based on polling or other data sources, and follows them 
forward through time through a series of transition probabilities. This technique has been 
successfully applied by several European nations (Gilbert, 2008). 
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An interesting aspect of agent models that Epstein devotes some time to is their abil¬ 
ity to allow multiple social theories, which he refers to as social computations, to be active 
at once in an experimental framework. Epstein, a proponent of the use of these models by 
social scientists, proposes the use of agent based social simulations to facilitate interdisci¬ 
plinary social science work. In general, Epstein adheres to what he refers to as a generative 
approach to social science (Epstein, 2006). This approach relies on the use of agent based 
models to facilitate the representation of individual actors with simple rule sets, what he 
refers to as micro-level behaviors. These simple rule sets represent the simplest possible 
representation of agent cognition. This generative approach proceeds with experimentation 
with the agent based social simulation with micro-level specifications, seeking to find a set 
of such specifications that produces macro-level output data that matches some historical 
data set or observed social phenomena of interest. 

This micro to macro-level mapping allows for the interaction of behavioral and so¬ 
cial sciences to some degree. The successful identification of such a set of micro-level 
behaviors leaves the analyst or social scientist with a screened hypothesis that has not been 
rejected given the conditions of the experiment. This is a critical point to make regard¬ 
ing this approach. In order to accept that the micro-level behaviors constitute a potential 
explanatory hypothesis describing the system under study, the modeler also has to accept 
that the interaction rules of the system and the data that have been used to populate the 
model are accurate and valid. In some domains, this may be a minor requirement, but 
in the application of these techniques to human behavior this serves as a major limitation 
of the generative approach. This also highlights the need for the integration of individual 
and group cognitive architectures into agent based social simulation in a more robust man¬ 
ner. The use of psychologically oriented cognitive architectures at the individual-level in 
this domain can provide structured integration between Epsteins micro-level specification 
from the field of psychology and understanding of societal-level collective behavior (Sun, 
2007b). 
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A needed area of research is the exploration of group cognitive architectures based 
on social psychology. Architectures based on social psychology would allow the researcher 
to change the level of analysis while preserving the benefits of the use of cognitive archi¬ 
tectures as a formal specification. Cognitive architectures provide a detailed specification 
of the framework of human cognition, including realistic constraints, capabilities and ten¬ 
dencies based on cognitive processes, but possess validation concerns as well (Sun, 2006). 

Epstein as well as Miller and Page describe the use of agent based social simula¬ 
tions to explore the generation of equilibrium conditions from societies of boundedly ratio¬ 
nal agents in decentralized locations (Epstein, 2006; Miller & Page, 2007). Gilbert adds to 
this highlighting the fact that societies are made up of individual actors with dynamic, non¬ 
linear interactions where knowledge and materials are exchanged that impact the future 
behavior of the individual actors (Gilbert, 2004). The dynamic nature of the system makes 
it impossible to study the society as a whole by examining individual members in turn, 
since macro-level system behavior emerges from the interaction of individually motivated 
actors on a dynamic landscape (Gilbert, 2004). Sun proposes the use of cognitive social 
simulation to explore these interactions at multiple levels, with psychologically based cog¬ 
nitive architectures embedded in agents within social simulation (Sun, 2006). Borrill and 
Tesfatsion propose that agent based models provide a local observer view of the world as 
opposed to a Gods eye view, and that this local observer view provides great insight for the 
social sciences highlights the importance of the cognitive architecture specifications within 
social simulation. They provide two examples of the application of agent based models to 
the social sciences: one in the analysis of power consumption in the U.S. and the second in 
the field of information storage design (Borrill & Tesfatsion, 2010). 

The application of agent based models to the social sciences has focused on the use 
of simple behaviors at the individual agent-level to generate macro-level social phenomena 
as described above. Common phenomena explored include segregation, trust formation, 
and resource constrained games (Schelling, 1972). The use of these models in conjunction 
with cognitive architectures has not been fully explored, but is required to fully represent 
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complex phenomena such as belief revision, changing affiliations, coercion and persuasion, 
and the formation of trust. The following section will discuss the topic of emergent behav¬ 
ior, a topic often used to refer to the micro to macro link, in greater detail (Zacharias et al., 
2008). 

Any discussion of agent based social simulations or agent based models will even¬ 
tually touch on the topic of emergence (Zacharias et al., 2008). Emergence, the idea that 
aggregate behavior arises from the interaction of local individual behaviors, provides one 
of the more contentious points of discussion in the area of complex systems and complex 
social systems (Epstein, 2006; Gilbert, 2004). The description of emergence most often 
used describes emergence as the result of individual, localized behavior aggregated into 
a global behavior disconnected from its origins (Miller & Page, 2007). This implies that 
the macro-level system behavior is robust to changes in the micro-level system behavior 
as specified in what traditional agent based model research refers to as micro-level behav¬ 
ior and cognitive social simulation specifies in the cognitive architecture. Miller and Page 
point out that emergence can occur at many levels within the system of systems, providing 
natural breakpoints in some cases for distinguishing sub-systems that might not previously 
have been apparent. Epstein concurs with this assessment, by characterizing emergence 
as phenomena relative to the current state of knowledge describing the systems of interest 
(Epstein, 2006). In this view, the definition of emergent behavior for a system under study 
relates to those unexpected events or conditions that give rise from micro-level behavioral 
interactions. Sun refers to this as the micro-macro link and emphasizes the central role 
of cognitive architectures in this phenomenon (Sun, 2006). When emergent macro-level 
behaviors are observed that correspond to phenomena, social or otherwise, observed in the 
real-world system, a potential explanatory hypothesis at the micro-level has been identified 
(Epstein, 2006). Those that ascribe to this view are clear that this is not the end of the 
analysis, but serves to focus further empirical work with the system to accept or reject this 
new hypothesis resulting from the observed emergent behavior (Epstein, 2006). 
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Miller and Page use the Law of Large Numbers and the Central Limit Theorem as 
examples of theories that describe how individual behaviors generate aggregate patterns 
that are robust against variation within limits (Miller & Page, 2007). The authors attribute 
the effectiveness of these two theorems to the concept of disorganized complexity. The key 
feature for this discussion is that the interactions of local entities tend to average out as an 
increasing number of independent actors are added to the system. The system eventually 
reaches a point where system-level predictions are possible through the above stated theo¬ 
rems, but individual-level forecasts are still elusive. Gilbert concurs with this assessment 
of emergence in social systems using the example that while we can identify the mission of 
a political organization, the identity does not always transfer to the individuals that belong 
to the organization (Gilbert, 2004). 

Miller and Page coin a term organized complexity in referring to those properties of 
complex systems that arise from the interaction of intentional agents in systems that provide 
feedback. In these types of systems, variations no longer average out, but rather become 
reinforcing causing the system to exhibit macro-level behaviors that would be termed emer¬ 
gent in this domain (Miller & Page, 2007). From this view point emergent macro behavior 
could be said to arise in systems with organized complexity as defined by Miller and Page, 
resulting from the central role of the cognitive architecture specification in the case of cog¬ 
nitive social simulation. This also meets the definition of emergence from Gilbert which 
defines emergence as any phenomena arising from a system that requires new categories in 
its description that are not necessary for the description of the system components (Gilbert, 
2008). Going back to some of their earliest uses of social simulation, Schelling used agent 
based models to study tipping points, particularly in regard to racial residential segregation 
processes, providing exemplars of this micro to macro link (Schelling, 1972). 

Gilbert provides several examples of what he terms emergent behavior from real 
social systems. He uses as examples such phenomena such as segregation into neighbor¬ 
hoods, the evolution of language, and migration. He points out that in social systems the 
actors can actually observe and identify emergent behavior and in some cases adjust their 
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behavior with what they perceive as the emerging trend, highlighting the importance of 
cognitive architectures (Gilbert, 2004). This corresponds to Miller and Pages view of sys¬ 
tems that exhibit organized complexity, with micro-level behaviors serving to reinforce 
emerging macro-level trends (Miller & Page, 2007). These cognitively aware actors reason 
and make choices at the individual-level, with results apparent at the macro-level. Castel- 
franchi, quoted by Sun, sums up the problem nicely: The real problem is modeling how we 
play our social roles, while being unaware of the functional effects of our actions, not only 
with our routine actions but even when doing something deliberately for our own subjective 
motives (Sun, 2006). 

Cognitive architectures play a central role in the generation of emergent behavior in 
complex social systems (Sun, 2006). Emergent behavior at the macro-level stems from the 
interaction of micro-level behaviors at the individual agent-level, as described above. Psy¬ 
chologically based cognitive architectures provide structure that encapsulates the cognitive 
functions of the agent and can lead to cognitive emergence, or bottom up learning, as de¬ 
scribed by Castelfranchi (2001) and referenced by Sun (Sun, 2006). At a coarse-level, these 
functions include perception, long and short term memory, and action selection. Since it is 
the resulting action selection from each individual that in the aggregate leads to emergent 
behavior, cognitive architectures, the action selection component in particular, implicitly 
influence macro-level behavior generation at the societal-level. The use of cognitive archi¬ 
tectures provides structure to the representation of human behavior and requires detailed 
specification, enabling greater documentation and potentially aiding in the exploration of 
the micro to macro link and validation efforts for this class of models (Sun, 2006). The 
need to represent micro-level behaviors in order to gain an understanding of the system 
under study, rather than the individual actors, leads to a discussion of the need for a holistic 
approach to system analysis to inform a holistic analysis. 

Agent based models are tools for holistic analysis of systems, but require a reduc¬ 
tionist approach in the development of micro-level behaviors, a minimal cognitive archi¬ 
tecture, for individual actors. Agents are intended to represent human behavior in the most 
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simplified manner that is still useful (Gilbert, 2008). The application of the reductionist 
approach to agent based modeling simply assists in the identification of a logical starting 
point for what Epstein calls the generative approach (Epstein, 2006). Discussion of the 
reasons that a pure reductionist approach to complex social systems is inappropriate is pro¬ 
vided by Miller and Page and Epstein at some length (Epstein, 2006; Miller & Page, 2007). 
The arguments presented by both groups tend to center around the existence of emergent 
behavior in complex systems that cannot be easily explained by simple system decomposi¬ 
tion as was discussed in the previous section. The authors use the partitioning of economics 
into micro and macro economics and the partition between the behavioral and social sci¬ 
ences as examples from well studied fields that recognize that the reductionist view does 
not fully account for system-level behaviors (Epstein, 2006; Gilbert, 2008; Miller & Page, 
2007). 

Agent based models provide a computational tool for use in creating a holistic view 
of the system under study. Computational tools do not differ in the evaluation of their 
utility from theoretical tools or mathematical tools they should be judged on their ability 
to enhance science and simplify a task. Theories by contrast should be judged on their 
ability to improve understanding of some phenomena of interest (Miller & Page, 2007). 
Computational tools force the modeler to be precise, whereas theories expressed in natural 
language can be open for interpretation (Gilbert, 2008). When modeling the system under 
study from this view point, the reductionist method of decomposition of the system to 
its constituent parts provides only a starting point to the development of a system-level 
model. The bottom up placement of these constituents into an environment that facilitates 
dynamic interaction representative of the real system completes the holistic view required 
to recognize system-level behaviors. 

In a similar manner, cognitive architectures provide a holistic view of human in¬ 
formation processing and decision making. The link between cognition, or micro specifi¬ 
cations, and macro-level social phenomena can be more fully explored through cognitive 
social simulations. Starting with relatively simple cognitive models, of the type typically 
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proposed by Epstein, researchers can iteratively add complexity to the cognitive representa¬ 
tion (Sun, 2006). Cognitive architectures consist of a system of systems replicating the hu¬ 
man information processing system. The cognitive architecture itself takes a holistic view 
of human cognition. The use of group cognitive architectures based on social psychology 
could enable this same notion of multi-level modeling with cognitive social simulations. 

C. A SAMPLE OF DEPARTMENT OF DEFENSE SIMULATION MODELS 

This section provides a brief overview of DoD models and simulation identified 
as candidates for the application of reinforcement learning and or the inclusion of some 
form of cognitive architecture. We provide an introduction to the Combined Arms Analy¬ 
sis Toolkit for the Twenty-first Century (COMBATXXI), the Infantry Warrior Simulation 
(IWARS), the Assignment Scheduling Capability for Unmanned Aerial Systems (ASC-U), 
and the Cultural Geography model (CG). 

1. Combined Arms Analysis Toolkit for the Twenty-first Century 

COMBATXXI is a Joint high-resolution, closed-form, stochastic, analytical combat 
simulation developed at TRADOC Analysis Center -White Sands Missile Range (TRAC- 
WSMR) and the Marine Corps Combat Development Command, Operations Analysis Di¬ 
vision (MCCDC-OAD) (Kunde, 2005). COMBATXXI is designed to support analysis at 
the brigade and below and supports the representation of light and heavy forces, air mobile 
forces, future force capabilities, fixed and rotary wing aircraft, and amphibious and urban 
operations. COMBATXXI has been used in support of a number of studies in support of 
acquisitions activities including analysis in support of the Ground Soldier Systems anal¬ 
ysis of alternatives (AoA) and the current Ground Combat Vehicle AoA. COMBATXXI 
behaviors are implemented in the python programming language allowing great flexibility 
in the creation of new behaviors to drive entity actions. Recently, the use of hierarchical 
task networks have been used to provide COMBATXXI entities a GOMS like ability to 
achieve goals. A need exists to improve the representation of environmental sensing and 
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knowledge representation at the individual soldier-level within COMBATXXI and to have 
more adaptable agents in general. 

2. UrbanSim 

game-based training simulations provide practice environments that allow trainees 
to develop knowledge, skills, and abilities required for the execution of real-world tasks in 
a relatively risk-free and cost effective setting making them especially attractive for DoD 
applications. Challenges exist in the measurement of the effectiveness of these systems 
in transferring skill proficiency to real-world tasks and the validation and verification of 
the simulation model and its supporting data, though the requirements for verification and 
validation for a training use case are less stringent that those imposed on simulation models 
designed for analysis. Unique to training systems is a need to verify that the reinforcement 
provided by the training system supports the learning objectives of the system. As the 
systems become more complex understanding how the system will respond to all potential 
actions selected by a trainee can present challenges due to the size of the potential state 
space. 


The Army requires the capability to develop adaptive digitized learning prod¬ 
ucts that employ artificial intelligence and/or digital tutors to tailor learning to 
the individual Soldiers experience and knowledge-level and provide a relevant 
and rigorous, yet consistent, learning outcome. (U.S. Army 2011) 

UrbanSim is a DoD sponsored game-based training simulation designed to develop 
knowledge, skills, and abilities associated with tactical-level decision-making in counterin¬ 
surgency and irregular warfare operations developed at the Institute for Creative Technol¬ 
ogy (ICT). UrbanSim relies on an underlying multi-agent simulation, PsychSim also de¬ 
veloped at ICT, to adjudicate the effect of actions taken by the trainee at each turn and to 
provide feedback to the player in the form of a numeric reward signal, graphically displayed 
as a horizontal bar chart showing the percent of the maximum score the player is currently 
at for each of the six categories that make up the aggregate score, (max = 600). The stan¬ 
dard training scenario packaged with the product is intended to train students to adopt a 
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clear, hold, build strategy by selecting the actions at each turn for eleven blue units at each 
of fifteen game turns. The game developers identified that the verification of all potential 
paths through the training system required the use of automated mechanisms (Wansbury, 
Hart, Gordon, & Wilkinson, 2010; Wang, Pynadath, & Marsella, 2012). 

3. Assignment Scheduling Capability for Unmanned Aerial Systems 

Unmanned systems form an increasingly important component of the surveillance 
and reconnaissance capability of the U.S. and have shown their value in the last decade of 
conflict in a variety of roles (Ahner, Buss, & Ruck, 2006). The efficient allocation of assets 
in theater remains challenging as does the analysis supporting future procurement of this 
class of systems. TRADOC Analysis Center (TRAC) conducted a UAS Mix study in 2006 
and developed an assignment scheduling capability for unmanned aerial vehicles (ASC-U) 
at TRAC-Monterey, in partnership with the MOVES Institute and Rolands and Associates 
Inc. (Nannini, 2006). 

ASC-U employs a discrete event simulation model coupled with the optimization 
of a linear objective function over a finite-time horizon. ASC-U determines a feasible 
schedule for UAV missions that can be successfully executed in a scenario with a specific 
mix of UAVs by obtaining an optimal solution to a simplified problem that assigns available 
UAVs to missions that are available or will be available within a future time horizon at set 
intervals. The need exists to improve the ability of the scheduler agent to develop near 
optimal plans schedules in cases with delayed and noisy rewards. 

4. Cultural Geography Model 

The Cultural Geogrphy (CG) model is a government owned, open source, agent 
based social simulation designed to represent the population in a brigade area of operations 
in an Irregular Warfare environment, see Figure 10 (J. Alt et al., 2009). CG model agents 
operate within a social network updating their stances on issues of interest as new informa¬ 
tion is received through observation or communications. The model is built on a conceptual 
model developed through interaction with experts from the social sciences and is intended 
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to be modular in nature with an architecture supporting the integration of behavioral and 
social modules as needed to support the modeling and analysis of a particular area of the 
world during a particular time period (J. Alt et al., 2009). The basic components of the CG 
model are the cognitive module, the social network module, and the infrastructure module. 
The cognitive module uses Bayesian networks to represent agent internal state, with agents 
representing population segments in most use cases. 
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Figure 10: Cultural geography conceptual model. 


The CG model does not specify a cognitive architecture. The social network module 
calculates the social distance between agents based on an implementation of the theory of 
homophily. The infrastructure module represents essential services as multi-server queues 
with finite capacity. Scenarios within the CG model are built around issues in the area 
under study that are relevant to the population and the stabilizing force. Data develop- 
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ment follows a process that mirrors the counter-insurgency intelligence preparation of the 
battlefield (Mansoor, 2007). 

D. SUMMARY 

In this chapter we provided relevant background material and reviewed the state of 
the art in reinforcement learning, cognitive architectures, and several currently used DoD 
simulation models. In the next chapter, we will present the novel use of an exponentially 
weighted average that makes use of continuous-time as an action-value estimator in a re¬ 
inforcement learning system and document the it performance in several benchmark and 
game-based environments. In Chapter IV, we will provide an account of the application of 
this approach to three different DoD simulation use cases: 

• ASC-U 

• COMBATXXI 

• UrbanSIM 

In chapter five we will discuss the use of reinforcement learning and document a 
practical cognitive architecture, both applied to emerging DoD efforts in the area of human, 
social, cultural and behavioral modeling within the Cultural Geography model. 
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III. DIRECT-Q COMPUTATION 


This chapter details a the novel use of an exponentially weighted average that makes 
use of continuous environment time as an action value estimator within a reinforcement 
learning agent and improves learning speed compared to dominant temporal differencing 
algorithms designed to address this problem. We develop the algorithm and provide results 
of empirical benchmarking in standard benchmark and game-based domains. Emerging 
results of the use of an early version of this algorithm in a variety of applications were 
previously reported in several conference proceedings, an accepted in-press journal article 
(S. Papadopoulos, Alt, Darken, & Baez, 2013), and student thesis work supported by this 
effort (J. K. Alt et al., 2011; Ozcan, Alt, & Darken, 2011; Ozkan, 2011; M. Papadopoulos, 
2010; Pollock, Alt, & Darken, 2011). 

A. DIRECT-Q COMPUTATION 

Delayed rewards are common in RL making the ability to efficiently assign credit to 
(s, a) essential to developing an accurate estimate of Q(s, a). The dominant approaches to 
address noisy and delayed rewards in RL make use of noisy hill-climbing approaches (tem¬ 
poral differencing). In this research, we attempt to determine if a relatively straightforward 
and direct approach that uses an intuitive estimator and better accounts for environmental 
change can result in improved performance over the dominant hill-climbing approaches to 
this problem. Our approach develops two key ideas: 

• Instead of noisy hill-climbing can we make use of the obvious intuitive estimator, 
the historical reward? 

• Instead of using the agent’s action count, can we make use of the environment time? 

DQ-C assigns credit for all future point rewards to each visit to a (s, a), developing 
an estimate based on the historical average of the long-term value of each attempt of a state- 
action pair. This long-term estimate uniquely makes use of environment time rather than 
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action counts in the development of this discounted term. These estimates are then used to 
calculate either the sample average or exponentially weighted average value of each (s, a). 
DQ-C’s unique treatment of time allows it to adapt to changing environments more readily 
than similar algorithms that do not account for time, focusing instead on system updates to 
serve as a counter. In this section we develop both versions of the proposed approach. We 
define the following indices, 

• i: rewards, i = {1 ..k}, where k is total number of rewards received, 

• j: attempts of a (s, a), j = {l..n}, where n is the total number of attempts of a 

(s, a), 

and the following additional terms, 

• rt: value of the i tH point reward, 

• t|: arrival time of the i th point reward, 

• Tj: selection time of the j tH attempt of a (s, a), 

• y: discount factor, y £ (0,1], 

• a: learning rate, a G [0,1], 

• t: current simulation time, t = {1 ..T}, where T is the maximum time in the simula¬ 
tion, 

• H(ti — Tj): an indicator function that is equal to 0 if ti - Tj < 0, 1 otherwise, 

which we will use throughout this section. We define the expected utility of each j th 
attempt of a (s, a) as the discounted sum of the point rewards, rt, received following visit 
to the (s, a). 

k 

Uj(s, a) = Y r i y tt ~ T) ’H(tt — Tj), (25) 

i=i 
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where H(t L — t, ) is an indicator function that ensures that only point rewards received 
following the attempt of a (s, a) are credited to the j th attempt, equal to 0, if t L — Tj < 0, 
or 1, if ft — Tj > 0. Each future point reward is discounted back to the time at which the 
(s, a) was visited. So we see that as (ft — tj) —» oo that the contribution of additional point 
rewards to this term goes to zero. Note that this does not depend on the number of k point 
rewards, only the time difference. Next we estimate the long-term value of each (s, a) by 
calculating the expected utility of each (s, a) using either a sample average, 


1 n 

Q(s, a) = — Uj(s, a), 
or an exponentially weighted average, 


(26) 


i=i 


Q(s,a) 


Lj=i at Ti Uj(s,a) _ £ j=1 a Ti Uj(s,a 


(27) 


L n + t ■ \ TX _ t . > 

j=i ’ 2_j=i a ’ 

Both versions can be updated upon arrival of point rewards or in a continuous man¬ 
ner. We will first discuss the recursive update to Uj(s, a) upon the arrival of the k+ 1 point 
reward. First define the utility of the j tH attempt of a state-action pair, (s,a), as a function 
of the k point rewards received, LI j (k) = U, (s, a). We can see that the expected utility 
of each ) th attempt of a state-action pair, Uj (s, a), only changes with the arrival of point 
rewards, so in order to update our estimate of the value of each attempt we need to update 
Uj(k+1), 


Uj(k + 1) = Uj(k) +r k+ ,y^'-7H(t k+1 (28) 

In order to update the overall estimate of the value of a state-action pair, Q(s, a), 
we first define it as a function of the number of attempts of each state-action pair, n, and 
the number of point rewards, k, received, Q(n, k) = Q(s, a). 
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We can think of the update needing to occur in two cases. In the first case, we 
have received no additional point rewards, but have increased our number of attempts of a 
state-action pair, Q(n + x, k), 

Q(rt +x, k) = —^—Q(n, k). (29) 

n + x 

In the second case, we have received additional point rewards and have made additional 
attempts of a state-action pair, Q(u + x, k + 1), 

1 n+x 

Q(n + x,k+ 1) = —'— Q(n,k) H--— Y r k+1 y tk+1 ^ T m(t k+1 -Tj). (30) 

j=i 

Note that the first half of this term is identical to the update for Q(n + x, k) and that in 
the second half of this term we simply update the value of all attempts, j = 1 ...n + x, of 
a state-action pair with the k + 1 reward, making sure that we weight this incremental up¬ 
date equally with the previous by multiplying by Since all previous attempts receive 
credit for new rewards, we must update all j visits rather than just the n...n + x attempts. 
This produces a straightforward and direct estimate that is equally weighted and takes into 
account environment time in the update of the expected utility of each visit, but does not 
adjust the weights on each visit based on environment time, see Algorithm 5. We make a 
decision in the Algorithm 5 to conduct updates only on the receipt of point rewards, how¬ 
ever, the algorithm could be easily adjusted to make updates following each attempt of a 
state-action pair with no receipt of a point reward. The choice depends on the needs of the 
application and computational considerations. We can make this estimate more sensitive to 
changes in the environment by developing the second key idea further by incorporating an 
exponentially weighted average. 
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Algorithm 5 Direct-Q Computation (DQ-C) with sample averaging 

1: Parameters: discount rate, y; base weight, oc. 

2\ Indices: i, index of point rewards, i E {1 ...k}; j, index of attempts of a state-action pair, j E {1 ...ri} 

3l Terms: number of attempts of a state-action pair, n; number of attempts of a state-action pair since last update, x; number of 
point rewards received, k; time of attempt of j th state-action pair, Tj; value of the i th point reward, r. 

4: Initialize 7t (ie. e-greedy, Boltzmann). 

5: Initialize for all state-action pairs as they are encountered: Q(s,a)=0, estimate of each state-action pair; n(s,a)=0, counter for 
number of attempts of a state-action pair; x(s,a)=0, counter for number of attempts of a state-action pair since last update. 

6: while agent is running do 
71 t <— current time 

8: Return a using 7t (s). 

9: Tj 4 — t 

10: Append Tj to list of most attempt times for (s, a). 

11: x(s, a) <— x(s, a) + I 

12: Observe for point reward, r. 

13: if point reward r is observed then 

14: for all state-action pairs observed, update Q(s,a) do 

15: if n(s,a) not 0: then 

16: x <— x(s, a) 

17: n •<— n(s, a) 

18: Q(s, a) «- JLQ(s, a] 

19: n «— n + x 

20: n(s, a) <— n 

21: x(s, a) 0 

22: for j from 1 to n do 

23: Q(s, a) -s— Q(s, a) + ^ry t_T i 

24: end for 

25: end if 

26: end for 

27: end if 

28: end while 


Considering the update of the exponentially weighted version we see that there is 
no change to the update of the expected utility, Uj (k), of each j th visit to a state-action pair 
from the previous update. In order to update the overall estimate of the value of a state- 
action pair, we first define it as a function of the number of attempts of each state-action 
pair, n, the number of point rewards, k, and the current time, t, Q(n, k, t) = Q(s, a). We 
choose to leave the current time in this version of the formulation since it allows us to avoid 
potential numerical precision errors and we provide an alternative without it subsequently. 
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In this case, since the weights applied to each j th visit are dependent on the current time, 
t, we must update the value of each set of weights upon each update, see Algorithm 6. 
Here we conduct an update to the numerator for each state-action pair as a function of the 
number of attempts of a state-action pair, the number of point rewards, and the current time, 
N (n, k, t) = N (s, a), and the denominator for each state-action pair as a function of the 
number of attempts of each state-action pair and the current time, D(n, t) = D(s, a). Let, 


N(n, k, t) = ^ cc* Ti Uj(k), 
l=i 


D (n, t) = ^ <x l Tj , 
l=i 


Q(n, k, t) = 


N(n, k, t) 
D(n, t) 


(31) 


Rearranging terms, we isolate our time dependent parameter in both the numerator 
and denominator, 


n i 

N (n, k, t) = a* V — Uj (k), 
z — a T > 

)=i 

N(rt,k, t) = ^Sf^k), 


where, 


and, 


S(rt, k) 


n 


L 


1 

a T i 


Uj(k) 


) 


(32) 


(33) 
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1 


D(n, t) = a Y — 
L — <x T > 

j=i 

D(n, t) = o^Cfn), 


(34) 


where, 


C(n) = Y_ —■ 

3 = 1 

Now we can state the update for N (n, k, t) as, 


(35) 


N(n + x, k+ 1,t) = a t S(n + x,k+ 1), 

n+x i 

S(ri + x,k + 1) = S(n,k) + Y — r k+ iy tk+,_T ’H(t k+1 -Tj), 

^— a T ) 

)=i 


and the update for D (n, t) as, 


(36) 


D(n + x, t) = c^Cfn + x), 

n+x i 

C(rt + x) = C(n) + Y —. (37) 

L — a T ) 

3 =n +1 


DQ-C uses all available data and by employing exponential weighting it allows the 
use of DQ-C in an on-line manner. The exponential weighting also allows DQ-C to be 
sensitive to changes in the environment, while avoiding the use of TD methods with their 
inherent assumption regarding the Markov property, which seldom holds in application 
environments. In Algorithm 6 we make use of the above formulation. See Appendix D for 
a brief theoretical analysis. 
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Algorithm 6 Direct-Q Computation (DQ-C) with exponential weighting 

11 Parameters: discount rate, y; base weight, ex. 

2: Indices: i, index of point rewards, i E {1 ...k}; j, index of attempts of a state-action pair, j E {1 ...n(s, a)} 

3l Terms: number of attempts of a state-action pair, n; number of point rewards received, k; arrival time of i th point reward, t^; 

time of attempt of j th state-action pair, Tj; value of the i th point reward, Tp. 

4: Initialize 7t (ie. e-greedy, Boltzmann). 

5: Initialize for all state-action pairs as they are encountered: Q(s,a)=0, estimate of each state-action pair; N(s,a)=0, numerator used 
in estimate, and the current time; D(s,a)=0, the denominator of the estimate. 



In the next section we will discuss the results of empirical benchmarking with DQ- 
C. 

B. BENCHMARK PROBLEMS 

We compare DQ-C to Q-leaming, SARSA, SARSA(A), and Q(A) using e-greedy 
and Boltzmann policies on the n-arm bandit task as a benchmark for learning speed in a 
simple single state environment. We next compare the performance of DQ-C to Q-leaming, 
SARSA, SARSA(A), and Q(A) in gridworlds ranging in size from 2x2 to 10x10 using 
e-greedy and Boltzmann policies across a range of input parameters and environmental 
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conditions in each case. In each case parameters were tuned for each algorithm to ensure 
optimal performance and the results reported reflect the best performance of each in the 
given task environment. 

1. N-Arm Bandit 

Bandit problems can be described as single state MDPs (s 0 , a n G A, R(s 0 , a n )) 
(Robbins, 1952). The problem is to determine the optimal policy to maximize long-term 
utility, 7t* (so) • Performance in bandit problems can be measured using regret, the cost of 
not choosing the best arm, 

T 

P = Tp*- (38) 

t=i 

accumulated over a finite-time horizon, T, where p. is the expected return of an arm and 
p* is the expected return of the best arm (Kocsis & Szepesvari, 2006). We treat this as a 
continuous learning task for 250 trials, where each trial is an attempt or pull of an arm. At 
the end of each trial the total accumulated regret measured and the mean and standard error 
of the sample of 500 replications is reported. Per Sutton’s description of the 10-arm bandit 
benchmark, we examine a case with zero noise on the reward, o 2 = 0, and a case with 
gaussian noise, cr 2 = 1. 

In the 10-arm benchmark task with zero noise on the reward, cr 2 = 0, SARSA 
paired with Boltzmann, (3, demonstrated the lowest total regret, see Figure 11, followed 
closely by Q-leaming, and DQ-C paired with e-greedy, see Table 1. The results for the 
10-arm case with noise on the reward, cr 2 = 1, are similar with DQ-C, paired with [3 
leading folllowed by Q-leaming and SARSA, both paired with e-greedy, see Table 2. DQ- 
C performs well in the most challenging of the bandit environments explored, the 10-arm 
benchmark with noise, accumulating 23% less regret that Q-learning, its nearest competitor. 
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Figure 11: Top three performers for 10-arm bandit benchmark domain, for cr 2 = 1.0 and 
a 2 — 0.0, 250 trials and 500 replications. Mean total regret is plotted for each algorithm 
policy pair along with associated standard error. 
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Table 1: Regret results for algorithms coupled with epsilon — greedy, |3, for 10-arm 
bandit benchmark task following 250 trials with 500 replications with cr 2 =0 on the reward 
signal. SARSA with e-greedy provided best result. 


Algorithm 

e 

(3 

Q-learning 

69.52il.23 

22.894=0.66 

SARSA 

71.17± 1.15 

22.43i0.65 

Q(A) 

44.4i 1.46 

87.08il.59 

SARSA(A) 

55.23± 1.86 

77.38i 2.22 

DQ-C 

25.214= 0.97 

55.85il.O 


Table 2: Regret results for algorithms coupled with epsilon — greedy, [3, for 10-arm 
bandit benchmark task following 250 trials with 500 replications with cr 2 =l on the reward 
signal. DQ-C with (3 provided best results. 


Algorithm 

e 

(3 

Q-learning 

46.56il.35 

67.45il.56 

SARSA 

46.77il.35 

63.94il.22 

Q(A) 

65.45il.71 

53.44il.88 

SARSA(A) 

60.8il.7 

53.07il.8 

DQ-C 

47.01il.34 

35.99il.27 


2. Two-Arm Bandit 

In order to further understand the performance of the algorithms in the most simple 
cases we also used the same procedures to examine the 2-arm bandit case. In this set of 
experiments we also explored the impact of the absolute difference between the mean of 
the two arms. When no noise is present DQ-C produced the best performance when the 
arms were well separated and was slightly outperformed in the case where the arms were 
only 0.2 apart by Q-learning and SARSA, see Tables 3. Since Q-leaming systematically 
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overestimates the best arm, this strategy pays off when the true values of the arms are close 
(Thrun & Schwartz, 1993). In the case where there is noise on the arms Q(A) produces 
lower regret than DQ-C in the case where the means of the arms are well separated, but in 
the most challenging case, low separation on the arms with noise, DQ-C produces the best 
result, see Table 4. 

Table 3: Regret results for algorithms coupled with epsilon — greedy and (3 for 2-arm 
bandit following 250 trials with 500 replications with cr 2 = 0, 5 = y* — yi = 0.8 and 0.2. 
DQ-C with (3 provides best results in 5 = 0.8 case, Q-leaming and SARSA paired with 
e — greedy in the 5 = 0.2 case. 


Algorithm 

e, 5 = 0.8 

(3 

aceQ—learning 

14.59i0.44 

30.83i0.64 

SARSA 

14.59i0.44 

19.81i0.97 

Q(A) 

65.45il.71 

4.96i0.26 

SARSA(A) 

73.68il.44 

6.36i0.48 

DQ-C 

17.73i0.57 

1.74i0.08 

Algorithm 

e,6 = 0.2 

(3 

Q-learning 

3.64i0.11 

21.66i0.20 

SARSA 

3.64i0.11 

21.53i0.21 

Q(A) 

18.87i0.40 

5.74i0.27 

SARSA(A) 

13.97i0.12 

15.15i0.620 

DQ-C 

4.52i0.14 

5.25i0.26 
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Table 4: Regret results for algorithms coupled with epsilon — greedy and |3 for 2-arm 
bandit following 250 trials with 500 replications with cr 2 = 1,5 = y* — yi = 0.8 and 0.2. 
Q(A) paired with (3 resulted in the best performance for 5 = 0.8 and DQ-C paired with [3 
for 5 = 0.2. 


Algorithm 

e, 5 = 0.8 

(3 

Q-learning 

14.49iO.52 

20.95i0.24 

SARSA 

14.53i0.52 

19.1i0.32 

Q(A) 

19.05i0.58 

9.28i0.41 

SARSA(A) 

18.48i0.75 

14.79i0.80 

DQ-C 

14.28i0.52 

12.33i0.65 

Algorithm 

e,S = 0.2 

(3 

Q-learning 

14.81i0.41 

31.96i0.75 

SARSA 

14.79i0.40 

19.53iO.85 

Q(A) 

35.47±1.38 

15.13il.18 

SARSA(A) 

21.71il.13 

11.57i0.89 

DQ-C 

14.68i0.42 

4.62i0.37 


3. Gridworld 

Gridworld provides a sequential MDP, a stochastic shortest path problem that can 
be described by the tuple (s G S, a n e A, P^ s ,, R(s 0 , a n )), see Figure 12. 
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Figure 12: Sample gridworld domain. 


We initially explore performance in deterministic and stochastic gridworlds varying 
in size from 2x2 to 10x10. In the stochastic version of this task, we use a simple model- 
based on the grid world from Russell and Norvig, but maintain only a single positive reward 
upon attainment of the goal state (Russell & Norvig, 2010). The transition matrix for each 
state is P* ’ s ' = 0.8 and P* ,s = 0.2.We measure performance by the mean number of times 
the goal state was attained over 1000 replications of each condition. In each case the trial 
period was limited to 500 time steps, meaning the agent sought to maximize the number 
of times the goal was attained during that period. In both the deterministic case and the 
stochastic case, DQ-C paired with either e-greedy or Boltzmann provided a greater overall 
utility, over two times the nearest competitor, than the comparison cases for the same time 
periods indicating that DQ-C learned faster in this benchmark domain, see Figures 13 and 
14 as well as Tables 5 and 6. 
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Figure 13: Top three performers for deterministic and stochastic 5x5 gridworld task, 500 
trials and 1000 replications. Mean total utility for each is plotted for each algorithm policy 
pair along with associated standard error. 
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Figure 14: Top three performers for deterministic and stochastic 10x10 gridworld task, 500 
trials and 1000 replications. Mean total utility for each is plotted for each algorithm policy 
pair along with associated standard error. 


84 








Table 5: Mean goals achieved for deterministic (top) and stochastic (bottom) 5x5 grid world 
following 500 trials with 1000 replications with single reward upon attainment of the goal 
state by algorithm and policy. DQ-C with e-greedy obtained the best result in each case. 


Algorithm 

e 

(3 

Q-learning 

3.84± 0.12 

3.04± 0.08 

SARSA 

3.734± 0.12 

2.93T0.07 

Q(A) 

2.23± 0.05 

2.17± 0.55 

SARSA(A) 

7.34± 0.22 

5.97± 0.16 

DQ-C 

15.25± 0.33 

14.45i0.33 

Q-learning 

2.508± 0.08 

2.24± 0.06 

SARSA 

2.49± 0.08 

2.24±0.06 

Q(A) 

1.74± 0.05 

1.77± 0.05 

SARSA(A) 

4.53± 0.14 

4.28±0.14 

DQ-C 

10.30± 0.28 

9.34±0.27 
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Table 6: Mean goals achieved for deterministic (top) and stochastic (bottom) 10x10 grid 
world following 500 trials with 500 replications with single reward upon attainment of the 
goal state by algorithm and policy. DQ-C with e-greedy obtained the best result in each 


case. 


Algorithm 

e 

(3 

Q-learning 

0.504A0.03 

0.586A0.02 

SARSA 

0.504± 0.02 

0.576A0.02 

Q(A) 

0.484± 0.02 

0.544± 0.02 

SARSA(A) 

1.67± 0.12 

0.668A0.03 

DQ-C 

3.03± 0.21 

1.12±0.07 

Q-learning 

0.47± 0.02 

0.39± 0.02 

SARSA 

0.47± 0.02 

0.394A0.02 

Q(A) 

0.45± 0.02 

0.384± 0.02 

SARSA(A) 

1.07± 0.07 

0.432A0.02 

DQ-C 

2.39± 0.16 

0.524A0.03 


In order to gain further insight into the performance of DQ-C under more challeng¬ 
ing conditions we repeated the procedures described above on a non-stationary stochastic 
version of gridworld, where the likelihood of transitioning from a state oscillated according 
to a sinusoidal function and a version of grid world with a dynamic goal. In both cases DQ- 
C outperformed alternative methods, with DQ-C paired with both exploration techniques 
finishing in the top two slots in each case followed by SARSA(A), see Figure 15. 
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Figure 15: Top three performers for 5x5 gridworld, with oscillating transition matrix and 
a dynamic goal, 250 trials and 500 replications. Mean utility is plotted for each algorithm 
policy pair along with associated standard error. 
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The results from the dynamic gridworld illustrate the success of the algorithm in 
performing well in non-stationary environments. 

4. Summary 

Our estimator results in faster learning on the most challenging empirical cases that 
we have explored, those with noisy and delayed rewards, with noticeable improvements 
relative to dominant algorithms under two of the more challenging conditions. Future work 
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will compare the performance of DQ-C to model-based methods, conduct further anal¬ 
ysis of its theoretical properties, and document its application in applied domains. See 
Appendix I for additional benchmarking results. 

In the next sections, we will examine how the results we observed in these bench¬ 
mark domains transition over to more challenging applied problems in game domains. The 
first section details the use of RL in a game-based domain, the Physical Traveling Salesman 
Problem. The second section provides results of the incorporation of an RL controller into 
the classic arcade game, Pacman. 

C. PHYSICAL TRAVELING SALESMAN PROBLEM 

The traveling salesman problem (TSP) is one of the most widely studied optimiza¬ 
tion problems with many approaches involving heuristics and meta-heuristics for finding 
approximately optimal solutions for for problems of very large scale involving many thou¬ 
sands of cities. The objective of a standard TSP is to minimize the total distance travelled. 
The physical traveling salesman problem (PTSP) makes a modification to the base prob¬ 
lem that makes the solution to even relatively small problems much more complex. In this 
modified version the salesman has mass and moves by applying a force vector to the mass 
at each point in time and controlling the orientation of the mass through direction controls. 
The objective is to find the shortest path through the cities, measured by the total time 
required to complete the circuit, while minimizing the number of force vectors applied. 
The environment is a two-dimensional board with ten waypoints and multiple obstacles, 
see Figure 16. This problem is currently part of an open competition involving two IEEE 
sponsored conferences: the 2012 IEEE World Congress on Computational Intelligence and 
the 2012 IEEE Conference on Computational Intelligence and Games (http://www.ptsp- 
game.net). 
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Figure 16: Sample map for physical traveling salesman problem. 


1. Problem Specification 

The PTSP requires that an object with mass, referred to as a ship, moving in a 
two dimensional environment with physics based laws of motion travel through a set of 


89 













waypoints while minimizing time with a lesser weight on the minimization of thrust. The 
controller must be initialized within 1000ms and at each iteration must respond with a 
control input within 40ms. We can describe the environment by defining the following 
indices, 


• t, time, t = {1...T}, where T is the maximum time allowed to reach a waypoint, 
constant at T=1000 steps 

• i, waypoint, i = {1 ..|W|}, where W is the set of waypoints that the ship must visit, 

• j, action index, j = {0,1,2,3,4,5}, 
and the following additional terms, 

• p t , the ship’s position vector in two dimensional space at time t, 

• o t , the ship’s orientation at time t, 

• v t , the ship’s velocity at time t, 

• a, the rotation step-size applied to the ships orientation upon a control action, con¬ 
stant at a = radians, 

• L, the friction factor applied to the ships velocity at each t, constant at L = 0.99, 

• K, the collision factor modifying the direction of the ship upon collision with an 
object, constant at K = 0.25 

• Ttnitiaiize, the maximum time allowed for a controller to initialize, constant at 

"^initialize 1000mS 

• Tresponse' the maximum time allowed for a controller to respond at each t, constant 
tit T r esponse 40mS, 


90 



which are used in the following update equations, 


Pt+i = Pt + v t+1 , 

_ ( cos( ot) — sin(cc) \ 

^t+1 y sin(cc) cos(ct) J 

Vt +1 = (v t + (o t +i T t K))L. 


( 39 ) 


The set of actions, a G A, controls steering and the application of thrust to the 
ship and all actions act as forces applied to the ship to update its state using the update 
equations. Since our focus is RL we decompose the PTSP into a path planning problem, 
where the goal is to select the next waypoint in an efficient manner, and a ship control 
problem, where the goal is to learn how to use the six actions available to navigate the ship 
to its destination. 

In order to determine which waypoint to navigate to next, we determine the closest 
waypoint from our current location that has not yet been visited, see Algorithm 7. If we 
have line of sight to the waypoint then the RL controller must control the ship to reach the 
waypoint, but if there is not line of sight to the next waypoint we locate the nearest location 
that does have line of sight to the waypoint and create an intermediate objective. In this 
case the RL controller first attempts to control the ship to reach the intermediate objective 
then re-assigns the destination to the target waypoint. Should the ship grossly overshoot its 
target, completely possible in this domain, it will identify the closest waypoint and reassign 
the waypoint objective. This is a purely greedy strategy, but serves its purpose in terms of 
providing a simple mechanism to select the next waypoint to facilitate the learning of the 
control task. 
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Algorithm 7 Greedy city selection 
1: Create city list. 

2: Set closest unvisited city as next waypoint. 
3: if no line of site then 
4: Set intermediate waypoint. 

5: end if 

6: Select control actions with RL controller. 

7: Increment t. Go to line 2. 


2. RL Formulation 

We focus the RL formulation on the control subproblem of the PTSR We define 
the RL control problem by the following tuple, (it e S, a e A, P *> s \ R(s)). The state 
description vector,s t , st = (O t , V t ,D t ), describes the state of the ship at time t, where 
O t (o t ) = orientation of ship relevant to goal, V t (v t ) = speed, D t (d t ) = distance from 
goal, and the state space is developed as described below. 


4 


3 


Ot(o t ) = < 2 


1 


0 


o t < —0.5 
—0.5 < o t ^ 0 
0 < o t ^ 0.75 
0.75 < o t ^ 0.97 
0.97 < o t 


Vt(vt) 


jo 0 < v t ^ 0.4 
[ 1 0.4 < v t 


Dt(dt) 


Jo 0<d t ^15 

|l 15 < d t 


The reward function is defined, 
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R(Ot,V t ,D t ) 


Ot j I,! 1 > O t &d t _i 



dt > 0.4 


and provides a graduated reward structure, but with a large bonus obtained only upon reach¬ 
ing the waypoint. 

The challenge of developing the appropriate state space for a given problem is a 
significant research topic in its own right. 


3. Empirical Results 

In order to gain insight into the performance of the algorithm policy pairs on the 
control task we conducted computational experiments comparing the performance of DQ- 
C, SARSA, Q-Learning, SARSA(A), and Q(A) when paired with e-greedy and (3. We 
evaluated performance across a range of parameters for each algorithm and show results 
for the best performance observed for each algorithm. Each combination is replicated 1000 
times, with each replication using a different initial random seed as a variance reduction 
measure. Additionally, performance is measured across ten maps of varying complexity, 
see Appendix E. 

DQ-C matches or outperforms the comparison cases across all map sets. DQ-C 
paired with Boltzmann, (3, had the highest number average number of waypoints found and 
the lowest time per waypoint, see Figure 17 and Figure 18. Q(A) paired with Boltzmann 
followed DQ-C, and in the case of at least one map outperformed DQ-C. The RL task 
is formulated as a sequential task where the agent tries to learn how to control the ship 
to reach the next waypoint. As a result of this even during a single trip to a single city, 
there is an episodic nature to the task. Since the state is always described in terms of 
the ship’s location relative to the next waypoint, the state-space is reduced significantly. 
DQ-C requires both y and a to be specified and is designed to address the temporal credit 
assignment problem. Interestingly the other top performers in this game environment, Q(A) 
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and SARSA(A), both include the notion of eligibility traces, also intended to address the 
credit assignment problem (Sutton & Barto, 1998). 



Figure 17: Mean waypoints obtained across all maps by top three performing algorithm 
and policy pairs. 
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Figure 18: Mean time per waypoints obtained across all maps by top three performing 
algorithm and policy pairs. 


4. Insights from the Physical Traveling Salesman Problem 

As we saw in grid world and the 10-arm bandit, DQ-C continues to match or out¬ 
perform the comparison cases across all map sets. DQ-C paired with Boltzmann had the 
highest number average number of waypoints found and the lowest time per waypoint. 
Q(A) paired with Boltzmann followed DQ-C. The reward structure here was graduated to 
facilitate faster learning given the large state space, with the large reward attained upon 
reach each waypoint serving as a large delayed reward compared to the magnitude of the 
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interim rewards. This is an example of an environment that makes use of continuous-time 
rather than a turn based counter. 

D. PACMAN 

The Pacman arcade game places a player controlled agent, Pacman, in a variety of 
mazes with the goal of avoiding ghosts, two in this version, while consuming all dots on the 
board. The player has access to two power dots, which when eaten allow the ghosts to be 
temporarily consumed by Pacman. We build on the infrastructure provided by the Berkley 
Pacman projects with our empirical work focused on the medium sized classic map, see 
Figure 19, (http://www.inst.eecs.berkeley.edu/csl88/pacman/pacman.html). 



Figure 19: Medium sized classic Pacman map. 
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1. Problem Specification 

Pacman requires the controller to select actions to avoid the ghosts while consuming 
dots. We can describe the environment by defining the following indices, 

• t, turn or time, t = {1...T}, where T is the maximum time allowed to reach a 
waypoint, constant at T=1000 steps 

• i, dots, i = {1 ..n}, where n is the number of dots that must be consumed, 

• j, action index, j = {0,1,2,3}, 

• k, ghosts, k = {0,1}, 

and the following additional terms, 

• p t , the Pacman’s position vector in two dimensional space at time t, 

• gic )t , the k tH ghost’s position vector in two dimensional space at time t, 

• o t , the Pacman’s orientation at time t, 

• Ci (t , an indicator describing the state of the i th dot at time t 

• f t , the number of dots remaining at time t. 

2. RL Formulation 

We define the RL control problem by the following tuple, (”s > G S, a G A, P^ ,s , R(s)). 
The state description vector,s t , st = (F d)t , Dt)- describes the state of the pacman at time 
t, where F(p t , Ci )t ) d)t = the presence of food in the adjacent locations, d G (E, W, N, S) 
and D(p t , gk,t)t = the minimum distance to the nearest ghost. 

0 if food not present 

Hpt> Ci,t) d) t S 

1 if food present 
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D(p t ,gk,t)t 


The reward function is defined, 


0 dist(p t , g k , t ) < 6.0 
< 1 6.0 < dist(p t , g k ,t) ^ 15.0 

2 15.0 ^ dist(p t , g k)t ) 


R(s t )=f t _ 1 -f t (40) 

providing a graduated reward structure based on the number of dots consumed since our 
goal in this case is to clear the board of food, with a large negative penalty, -500 when 
awarded when the Pacman is killed. 

This formulation resulted in extremely slow learning and in order to speed learning 
and improve performance we opted to include additional domain knowledge in the form 
of top-level strategies based on the same state. This approach is similar to one taken pre¬ 
viously in attempts to incorporate RL into the Pacman environment and emphasizes an 
important point regarding learning speed. In most applied cases, the controller needs to 
leam policies in certain key decision points. This is worth noting as the same problem is 
encountered in the representation of human decision makers. The three strategies were to 
flee, to findFood, or to pursue a greedy, see Algorithms 8, 9, and 10. These strategies 
became the action space for the learner and the task became to map states to strategies. 

Algorithm 8 Flee 

1: Identify direction that increases the distance from the nearest threat. 

2: Select direction. 

3 : Select strategy with RL controller. 

Algorithm 9 Greedy 

1: Identify adjacent direction that contains food. 

2: Select direction. 

3 : Select strategy with RL controller. 


98 





Algorithm 10 findFood 

1: Identify direction that reduces the distance to the closest food cell. 
2: Select direction. 

3: Select strategy with RL controller. 


3. Empirical Results 

In order to gain insight into the performance of the algorithm policy pairs on the 
control task we conducted computational experiments comparing the performance of DQ- 
C, SARSA, Q-Learning, SARSA(A), and Q(A) when paired with e-greedy and Boltzmann. 
We evaluated performance across a range of parameters for each algorithm and show re¬ 
sults for the best performance observed for each algorithm. Each combination is allowed 
to train for 1000 games, with a consistent random seed used across all combinations as a 
variance reduction measure. The mean percentage of the board cleared across all games 
was used as a measure of performance. SARSA(A) paired with either e-greedy or Boltz¬ 
mann significantly outperformed competing algorithm policy pairs and with the excep¬ 
tion of SARSA(A), Boltzmann paired algorithms consistently outperformed their e-greedy 
counterparts, see Figure 20. 
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Figure 20: Mean percent of board cleared by all algorithm and policy pairs for best per¬ 
forming parameter settings. 


4. Insights from PACMAN 

In this section we examined the application of RL to a more complex domain with 
a challenging task, the control of the Pacman in the arcade game of the same name. This 
is a challenging adversarial environment with a potentially very large state space. We ob¬ 
served cases in this domain where DQ-C did not strictly outperform its competitors and a 
case where SARSA(A) performed well. In all cases, learning was improved through the 
incorporation of domain knowledge through the use of strategy combinations. Here we are 
not asking the controller to learn atomic-level information, only the mapping between the 
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state of the controller and the discrete strategy choices. Note that the reward structure in 
this case still possessed the large delayed associated with the attainment of the waypoint in 
the previous section, but because of the difficulty in achieving the reward, which required 
complete clearance of the board, the controller in most cases never achieved it. The adver¬ 
sarial nature of the environment also served to shorten learning periods requiring multiple 
restarts. This aspect of the environment is similar in nature to challenges that would be 
present in the use of RL in a combat model, where attrition due to combat would create 
new states that would require additional learning-time. 

E. INSIGHTS ON DQ-C 

DQ-C showed a significantly better performance on standard benchmark problems 
and game-based domains than dominant algorithms in that rely on noisy hill-climbing pro¬ 
cesses. We make use of all available information by using the long-term historical reward 
to determine the value of the state-action pair at the time it was chosen and then ensure that 
we are weighting the most recent attempts of a state-action pair more heavily than early 
attempts through the use of an exponentially weighted average tied to continuous system 
time rather than simple counters. The use of discounted sums to develop an estimate of 
the long-term value of each attempt of a state-action pair linked to continuous environment 
time is novel as is the use of an exponentially weighted average tied to environment time. 
Weights are constantly adjusted with each attempt of a state action pair. We avoid temporal 
differencing and reduce the number of tunable parameters required to incorporate the no¬ 
tion of a recency bias from 3 to 2. In the next chapter we discuss the application of DQ-C 
to three applied DoD simulation use-cases. 
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IV. APPLICATIONS 


This chapter details the incorporation of RL in a variety of applications, with a 
focus on comparison of Direct-Q Computation with similar algorithms. In the subsequent 
sections we provide example application of the algorithm in a two game-based domains 
and three DoD simulation models each, each related to a different practical use case. In the 
First section, RL is used to address two cases not currently covered by the Assignment and 
Scheduling Capability for Unmanned Aerial Vehicles (ASC-U) tool. The Second section 
shows a proof of principle integration of RL into COMBATXXI as a mechanism to control 
entity level decision making. The final section provides the results of the application of RL 
to verify the reward structure of a training simulation, UrbanSim. 

A. UNMANNED AERIAL VEHICLE ASSIGNMENT AND SCHEDULING PROB¬ 
LEM 

Dynamic programming provides an appealing alternative to solve many applied 
military planning problems, such as the unmanned vehicle assignment and scheduling 
problem, that do not lend themselves to pure mathematical programming techniques. This 
problem includes dynamic UAV assignment constrained by the location of launch and re¬ 
covery sites (LRS), and mobile ground control stations (GCS) with finite control capacity 
from which UAVs must be controlled throughout flight operations. Value in this problem 
is maximized by the assignment of UAV packages to mission areas with mission require¬ 
ments that match package types. The Assignment Scheduling Capability for Unmanned 
Aerial Vehicles (ASC-U) simulation model provides an approximate solution to this prob¬ 
lem using concepts from discrete event simulation and optimization to implement an a 
dynamic programming solution for tactical-level scenarios (Ahner et al., 2006; A. Buss, 
2009). 
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1. Assignment Scheduling Capability for UAVs 

ASC-U uses a discrete event simulation model coupled with the optimization of a 
linear object function to develop a feasible schedule for UAV mission assignments to mis¬ 
sion areas. Value is obtained from the correct pairing of mission areas with UAV package 
types. UAV packages possess capabilities required to obtain value from a given mission 
area assignment based on the demands associated with that mission area. ASC-U provides 
a feasible solution at set intervals to a finite horizon assignment problem given complete 
information concerning the mission demands and associated values within the finite-time 
horizon. The simulation calculates an overall mission value for each assignment by con¬ 
sidering the required flight time to each mission area and the amount of time each UAV 
covers the area. Note that all mission areas and their associated values are fully observable 
by the scheduler at the planning point for the entire time period being scheduled (Ahner 
et al., 2006). The scheduler provides each UAV only the initial mission area assignment, 
ensuring that only one UAV is assigned per mission area, and relies on a greedy heuristic 
to allow each UAV platform to select a subsequent mission should the mission area they 
are currently covering close. The need exists to improve the ability of the scheduler agent 
to develop near optimal plans schedules in cases with delayed and noisy rewards. 

2. Problem Specification 

The full dynamic programming formulation, see Appendix F, presented challenges 
due to the large state and control space, so in order to overcome these issues the problem 
was reformulated to a linear optimization problem that is solved at set time intervals during 
the course of a discrete event simulation, see Appendix F. 

3. RL Formulation and Empirical Results 

In this section, we describe two approaches to the application of RL to address the 
UAV assignment and scheduling problem. The formulations differ in manner in which the 
decision problem is formulated and in the appellation of RL to the problem. We will refer 
to the existing ASC-U framework as the base case. 
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a. 


Case 1 Formulation 


In the first case, we simply replace the greedy heuristic to reassign UAV’s 
following their initial allocation with an RL formulation. As each platform completes its 
current mission it is presented with a set of open feasible mission areas to choose from and 
in the absence of options will return to its LRS. In the base case, the platform is provided 
complete and accurate information regarding the value of each alternative and it chooses 
the mission with the maximum value at time 0 following the execution of the base case 
linear program. This approach also does not account for delayed emergent rewards and 
will miss the opportunity to pursue a higher valued mission area that is observable at the 
beginning of the planning horizon, but is open at some point in the future. 

In order to provide an alternative to the base case heuristic we can formu¬ 
late the problem as a bandit problem from the perspective of each platform as it becomes 
available. The arms in this problem are the set of open feasible mission areas and the op¬ 
tion to return to base or loiter. The state is defined by the platform and reward is obtained 
following completion of the mission. 

b. Case 1 Empirical Results 

In order to evaluate this case we set up a small benchmark scenario that 
produces the situation where we expect that the base case will miss the opportunity to 
achieve a delayed reward, see Figure 21. In order to illustrate this case, we provide a 
simple scenario with a single LRS, single GCS and single UAV assigned. Two mission 
areas are observable in the initial planning horizon. The first mission area is available from 
time 0 to time 3 and provides a value of 10 per time unit covered. The second mission 
area opens at time 3 and remains open through time 7 providing a value of 10 per time 
unit covered. The base case will schedule the coverage of the first mission using the linear 
program, once the emergent mission area becomes available it will switch over to cover 
the emergent target with its remaining capacity. The base case will consistently miss out 
on additional value from the emergent target, no matter how many times that situation is 
encountered, since it follows a purely greedy policy with no real notion of state. 


105 



Case 1 Mission Areas Values 
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Figure 21: Case 1 mission area timing. 
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In the RL formulation we provide the alternative to simply preserve capac¬ 
ity in order to take advantage of a future high value mission area and the learning prob¬ 
lem becomes one of identifying the appropriate time to preserve capacity versus behaving 
greedily, our familiar exploration and exploitation problem. This results in an improvement 
over the base case regardless of RL algorithm chosen, see Figure 22. In the course of a 
scenario this uncovered case serves to insert error into the schedule and since the scenario 
developer has no means of knowing how often this occurs during the course of a scenario 
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Figure 22: Case 1 results. 


run, the potential for lost value, or regret, to accumulate over the course of a large scenario 
is great. 

RL provides a mechanism to address this, but does require the use of mul¬ 
tiple replications to arrive at a feasible solution, while the base case requires only a single 
run. 


c. Case 2 Formulation 

In the second case, we replace the linear program that provides the initial 
assignment with an RL formulation. The RL formulation is provided the same information 
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used by the base case at each planning interval. The set of available platforms are assigned 
to open mission areas. The base case assigns a only a single mission to each platform and 
does not see past the specified time horizon. We can expect that this approach will fail to 
achieve maximum value for those cases where all platforms are assigned to long durations 
mission areas at time t and new mission areas of higher value open at the next scheduled 
window. 

In order to provide an alternative to the base case we provide the RL for¬ 
mulation with the same information provided to the base case and constrain it to decision 
making on the same interval, treating the assignment of each mission package to a mission 
area as an n-arm bandit problem, where the arms are defined by the mission areas and the 
option to “do nothing”. State is defined by the mission package name, with the option of 
using the assignment time since the mission areas are defined by open and close times. In 
this case we will simply maintain the definition of state by the mission package name. 

d. Case 2 Empirical Results 

We first establish a scenario which reproduces the conditions where we 
would expect the base base to fail and confirm that the base case fails to obtain the maxi¬ 
mum value under these conditions. In order to demonstrate this we establish a single LRS, 
with a single GCS and UAV assigned. The scenario consists of two mission areas, one open 
from time 0-9 which provides a value of 10 per time covered and a second open from time 
11-15 with a value of 1000 per time covered, see Figure 23 and Table 7. The time horizon 
is set at 10, meaning that the base case will have full knowledge of mission demands out to 
time 10, and will optimize the mission package and mission area pairing at time 0 and time 
10. The platform only has the ability to stay on station for 9 time units. 
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Figure 23: Case 2 mission area timing. 
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Table 7: Scenario configuration parameters for ASC-U Case 2. 


Parameter 

Value 

Optimization interval 

10 

Scenario length 

20000 

Total missions 

2000 

Platform transition time 

1.5 

Operational endurance 

9 

Time horizon 

9 


The base case consistently fails to cover the delayed reward available from 
the second open mission area since it develops an optimal schedule for the fully observable 
finite-time horizon problem with no regard to potential higher value mission areas just 
past that horizon. In this case, all resources are expended during the first planning period, 
leaving no free platforms to cover the second mission area. This case in particular occurred 
frequently enough in practice with larger scenarios to be noted by the developers. 

The RL formulation is provided the same scenario and the same information 
regarding the state of the world as the base case, but is also provided the option to “do 
nothing” for each platform. The reward in this case is delayed since due to the scenario the 
value of the subsequent mission area is not available at the time of the subsequent decision 
period. The RL formulation is able to quickly identify the value of the delayed reward in 
this simple case. The base case and RL aggregate value scores using DQ-C and e-greedy, 
see Figure 24, illustrate that the incorporation of RL can overcome the issues encountered 
by the base cases reliance on a fully observable environment. This emphasize the difference 
between a deterministic policy, such as a greedy algorithm or optimization, and a stochastic 
policy, such as those typically used in conjunction with RL. 

We compare DQ-C’s performance on this task with SARSA(A), and Q(A) 
when paired with e-greedy, see Figure 24. 
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Figure 24: Case 2 comparison mean value per iteration for each algorithm. 


4. Insights on the Use of Reinforcement Learning in a Scheduling Tool 

The main insight in the application of RL within a scheduling tool is that RL does 
not require a fully observable environment as is the case in most optimization-based sched¬ 
ulers, such as ASC-U. Since RL does not make use of a deterministic policy, such as a 
greedy approach, RL is less likely make a consistent sub-optimal decision as we saw in the 
base case analysis of case 2. However, it does mean that RL will require more computa¬ 
tional time and more iterations over the problem than the deterministic case. Depending 
on the use case and size of the problem the use of RL might not be practical, but it does 


111 







provide an approach to avoid those situations where a deterministic scheduler will consis¬ 
tently fail to achieve an optimal schedule. In the case of ASC-U, the case in question had 
been raised to the developers and was a known problem, but not one that a scenario devel¬ 
oper could easily check for or avoid. This research contributes an approach to overcome 
this deficiency. As ASC-U moves toward scenarios with noisy rewards and only partially 
observable environments, the use of methods such as RL, which are designed to address 
these issues will be essential. Some key insights: 

• RL methods are able to provide a solution that out performs the base case linear 
program in those cases where the optimal policy involves preserving capacity for 
delayed rewards not fully observable at the time of planning. 

• DQ-C demonstrates the ability to quickly leam in cases with delayed rewards, but 
in cases where feedback is more immediate and where there is limited noise on the 
reward signal, other techniques perform as well as DQ-C. 

• In this setting the use of a greedy-in-the limit approach to exploration and exploita¬ 
tion is appropriate, with the final state-action values serving as the scheduling pol¬ 
icy. 

In the next section, we will explore the application of RL algorithms to represent 
human decision making in COMBATXXI. 

B. ADAPTIVE BEHAVIOR IN COMBATXXI 

COMBATXXI is an entity level high resolution, closed-form, discrete event sim¬ 
ulation used for analysis of future capabilities at the brigade level and below by the U.S. 
Army and the U.S.M.C. analytic organizations, TRADOC Analysis Center and the Opera¬ 
tions Analysis Division respectively. COMBATXXI represents ground combat of light and 
heavy forces, air mobile forces, future forces, aerial assets, amphibious and urban oper¬ 
ations. COMBATXXI additionally represents the communications network and the flow 
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of information on the battlefield. COMBATXXI has seen extensive use in the support of 
analysis of alternatives (AoA) for both the USA and USMC (Kunde, 2005). 

Agents in analytic combat simulations, such as COMBATXXI, have relatively brit¬ 
tle decision making capability making, as illustrated in Figure 25, the development of 
scenarios time-consuming and making it challenging to conduct analysis of topics related 
to the value of information. The typical modeling and simulation agent requires an apriori 
mapping from states to actions, as represented by the arrows in Figure 25. This requires the 
anticipation of each possible state and an explicit decision, regardless of the environment 
dynamics, of the action the agent is to take in each case. 



The incorporation of autonomous agents that make use of reinforcement learning, 
see Figure 26, present one potential technique to overcome this brittleness problem while 
potentially reducing scenario development time and improving the analysis of capabilities 
intended to impact the situational awareness of decision-makers. In the case of the rein¬ 
forcement learning agent, rather than script a set of rules in advance, the agent develops an 
estimate of the value of its actions set in each state encountered. No hard coding of rules 
is required. This approach relies on the dynamics of the environment in which the agent 
is operating to provide feedback on performance to update its estimate, and in that manner 
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can also potentially inform the verification of simulation models. If following some learn¬ 
ing period, the agent’s ranking of the potential actions available from a given state does not 
match what a subject matter expert or modeler believes should be the case based on their 
full knowledge of the environment, then potential issues with the environment might be 
uncovered that would otherwise go unnoticed. 

In this section, we demonstrate the use of reinforcement learning to control the 
actions of agents within a two cases in COMBATXXI, both oriented toward route planning. 


RL Agent 




Figure 26: Modeling and simulation agent based on reinforcement learning. 


1. Case 1 Formulation 

In the first case, we examine a route selection problem in which the agent does not 
know the length of the route in advance. The state of the agent is defined by its waypoint 
location. The agent is provided a time based reward upon arrival at the final waypoint. 
Adaptive route selection is a common problem in combat simulations. In the real-world a 
unit will learn over-time which patrol route can be traversed in the shortest amount of time 
and with the least amount of threat. In this very simple case, the learning task to to learn 
which of two routes provides the shortest travel time to a destination see Figure 27. 
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Figure 27: Route selection scenario COMBATXXI. 


2. Case 1 Empirical Results 

We see in this case that over a fairly short period of time, the agent learns to select 
the shorter of the two routes. DQ-C and SARSA(A) both minimize regret and score well in 
total utility, see Figure 28. This case is a delayed reward case and it is interesting to note 
that Q(A) does not perform well in comparison to DQ-C and SARSA(A). 
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Figure 28: Route selection scenario COMBATXXI. 


3. Case 2 Formulation 

In the second case, we incorporate RL within a hierarchical task network designed 
to select a formation type based on information from the terrain, see Figure 29. At each 
waypoint, the agent makes a decision regarding the appropriate movement formation to 
use to move to the next waypoint. Reward is provided upon the arrival at the next way- 
point based on the choice of movement formation. The task is to learn to use a bounding 
overwatch if there are more than two buildings between the start and end points of the 
movement leg. Decision-making often is informed by domain knowledge. In this case, we 
leverage hierarchical task networks to enable the agent to focus learning on only the rele- 
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vant decisions. The RL controller is incorporated into a decision node within the “move in 
formation” HTN. When presented with a decision situation, agent makes a decision based 
on the state of the world as defined by the number of buildings between it and the next 
waypoint. This is a simple case, but both the decision and the feedback could be made 
more complex in future cases. 



Figure 29: Formation selection scenario COMBATXXI. 
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4. Case 2 Empirical Results 

In this use case, we structure the reward to reinforce the desired TTP for a given 
situation. The state is defined by the number of buildings on the next leg of the circuit 
and the decision-making agents task is to learn the correct movement formation for a given 
number of buildings. A reward is provided for choosing the correct movement formation 
when the number of buildings is greater than 5 and a penalty of -1 is applied for choosing 
the incorrect movement formation when the number of building is less than 5. We compare 
the learning-time under this formulation for DQ-C, SARSA(A), and Q(A) in Figure 30. 
The results are consistent with previous cases, with DQ-C showing less regret per turn and 
a higher overall utility than the comparison cases. Note that this case does not involve a 
delayed reward and we see Q(A) and SARSA(A) switch places in the ranking. 
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Figure 30: Formation selection scenario COMBATXXL 


5. Insights on the Use of Reinforcement Learning in Combat Simulations 

The application of reinforcement learning within combat simulations in order to 
represent human decision making presents unique challenges, particularly in the analysis 
domain. Strict verification and validation guidelines require that algorithmic implementa¬ 
tions be built on a valid conceptual model. In this case, RL is uniquely suited in that regard, 
having been developed in the literature on the study of animal behavior and continued and 
built upon in the literature on operant conditioning, and in fact still being contributed today 
by the field of neuroscience. RL provides a simple framework upon which to build com¬ 
plex adaptive behaviors. A key insight into the application of RL in the human behavior 
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representation use case is that an RL agent, much like a human, will progress through a 
learning period much faster with quality practice focused on the relevant decision points 
defined by the context. In this regard, the agent’s perception of its state, or situation, in¬ 
forms it when it is appropriate to learn and when it is appropriate to exercise a deterministic 
policy known in advance. The use of RL in combat simulations lends itself to judicious use 
in specific decision situations. In order for learning to occur during a run the situation must 
be encountered multiple times, which implies that for practical purposes agent’s should 
be put through training scenarios with similar situations and the state-action values stored 
for initialization in record runs of the combat model. A natural effect that could be gained 
from this type of approach is a mechanism to represent the effects of various level of soldier 
training and experience on performance. 

The use of RL in conjunction with domain knowledge is also critical for the combat 
simulation model use case. This domain knowledge assists in the recognition of the situ¬ 
ation and the need for a decision. The hierarchical task networks explored here provide a 
natural construct to facilitate this type of behavior in COMBATXXI. 

C. VERIFYING THE REWARD STRUCTURE IN TRAINING SIMULATION 

This section will apply reinforcement learning to the problem of verifying that the 
reward structure in game-based training simulations supports stated learning objectives. 
Specifically, we will apply RL to UrbanSim, a game designed to teach the doctrinal strategy 
to tactical-level leaders operating in a counterinsurgency setting (Wansbury et al., 2010). 
Trainees make a decision at each game turn based on their perception of the state of the 
environment, see Figure 31 and receive a numeric reward signal following the completion 
of each turn displayed in the lower right portion of the interface. In order to verify that the 
training system rewards the trainee for adhering to doctrinal actions we will employ an RL 
learner. The game developers identified that the verification of all potential paths through 
the training system required the use of automated mechanisms (Wang et al., 2012). 
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Figure 31: UrbanSim player interface. 


1. Problem Specification 

UrbanSim requires the player to select actions maximize a numeric reward over a 
finite-time horizon. The player chooses from a raw list of actions for each agent for each 
turn, with an action space as large as 341 for some agents on each turn. The order in 
which actions are chosen matters within the context of the game. Initially, we provide a 
description of the problem that incorporates the use of domain knowledge. We gradually 
increase the difficulty of the formulation until it mirror the unconstrained problem faced by 
the human player. We can describe the environment by defining the following indices, 
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• t, turn, t = {1 ...T}, where T is the maximum number of game turns allowed, con¬ 
stant at T=15 turns 

• i, action index, j = {c, h, b}, 

• j, agents, j = {0,1,11}, 
and the following additional terms, 

• the action of agent j at turn t, 

• r t , the score returned to the player at turn t. 

Note that the actions chosen by the reinforcement learner are strategy choices. 
These result in a draw from a bin of equally likely actions associated with that strategy. 
These actions were binned by subject matter experts based on the strategy they were most 
closely associated with. There is variance between the impact on the score of actions within 
each bin, resulting in a noisy reward signal for each strategy choice. 

2. RL Formulation and Empirical Results 

In this section, we describe three formulations explored with UrbanSim. The for¬ 
mulations differ in the state information provided to the agent and the timing of the reward 
signal. 

a. Case 1 Formulation 

Initially, we can formulate the problem as a sequential sampling problem 
allowing our agent to make a single decision on strategy at the beginning of the game and 
then receiving a score following the completion of the game with no opportunity to adjust 
strategy. This reduces the problem to a single state problem with an action space consisting 
of all possible combinations of the three actions. In this case, the default behavior of the 
system is to allocate each strategy to the third of the game in which it is specified in the 
ordering of the characters, for example, a strategy of “clear, hold, build” will result in the 
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use of a “clear” strategy by all agent during the first third of the game, a “hold” strategy 
the second third, and a “build” strategy during the final third of the game. The actions 
associated with each strategy can also be partitioned as “lethal, non-lethal, mixed” and 
further as “clearly correct, mixed”. This results in 162 strategy combinations, or arms, that 
must be explored. 

b. Case 1 Empirical Results 

In order to gain insight into the true value of each arm we first simply ran 
each of the 162 strategy choices for 30 replications each, with a unique random seed for 
each of the 30 replications, see Figure 32 and Figure 33. The results clearly indicate that 
some strategies are more preferable than others and that indeed it is reasonable to formulate 
this as a noisy bandit problem. 
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Figure 32: Mean and standard error of the final score of a 15 turn game following 30 
replications of each of the 27 strategy combinations. 
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Figure 33: Mean and standard error of the final score of a 15 turn game following 30 
replications of each of the 162 strategy combinations. 
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Table 8: Recommended strategies following 1000 training sessions by algorithm policy 
pair. 


ALGORITHM-7T 

Strategy 

DQ-C, Boltzmann 

MKHBC 

DQ-C, e-GREEDY 

SNHHB 

Q(A), Boltzmann 

MRHHC 

Q(A), e-GREEDY 

MKBCC 

SARSA(A), Boltzmann 

MRHHC 

SARSA(A), e-GREEDY 

MKBCC 


We next formulated this as a 162 arm bandit problem for the purpose of 
understanding how long a learner might take to arrive at the optimal policy. We compare the 
results obtained using DQ-C, Q(A), and SARSA(A) paired with Boltzmann and e-greedy 
strategies, see Figure 34. In this case we used the empirical maximum observed value as 
the best score in our regret calculations. Note that using this approach we could understand 
the learning-time required for a learner to arrive at the optimal policy by modeling this as 
a bandit problem outside of UrbanSim using the observed mean and variance of of each 
strategy as the reward on each arm. We see, Figure 34, that DQ-C paired with e-greedy 
minimizes regret over the course of 1000 15 turn games with fixed parameters. Parameters 
used for each algorithm were set to the parameters which produced the best results in the 
noisy-bandit case. 
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DQ-C Q(A) SARSA(\) DQ-C 0(A) 


Figure 34: Regret per turn and mean total regret for 162-arm bandit formulation of Urban- 
Sim constant exploration rate. 


We also compare the performance of the algorithms using a decaying explo¬ 
ration rate, see Figure 35. The maximum score on this game is 600, the regret total regret 
and regret per turn for each pair are shown below following the completion of 500 games. 
Given the size of the action space we adopt a search then converge approach to control the 
level of exploration, with the remainder of the parameters tuned to the settings identified in 
our exploration of the noisy 10-arm bandit. 
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Figure 35: Regret per turn and mean total regret for 162-arm bandit formulation of Urban- 
Sim using decaying exploration rate. 
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Table 9: Recommended strategies following 500 training sessions by algorithm policy pair 
using a decaying exploration strategy. 


ALGORITHM-7T 

Strategy 

DQ-C, Boltzmann 

MKHBC 

DQ-C, C-GREEDY 

SNBHH 

Q(A), Boltzmann 

MRBCH 

Q(A), e-GREEDY 

SKCBB 

SARSA(A), Boltzmann 

MRBCH 

SARSA(A), e-GREEDY 

SKCBB 


We gain the following insights from this set of experiments, 

• Formulating the simplified problem as a bandit allows us to effectively use RL in a 
sequential sampling role to adaptively adjust the design of experiments. 

• When used in a sequential sampling setting to determine the optimal action choice 
the use of a high level of exploration is required initially to ensure broad sampling 
given a relatively small number of trials. While not a primary topic of this thesis, 
it is interesting to note that with only 500 replications we are able to identify a 
strategy with a value of 320 only 30 points below the max observed using batch 
runs with consumed a total of 4860 runs. 

c. Case 2 Formulation 

We can gain insight into the optimal policy for each agent each turn by incor¬ 
porating the this information into the state description. Here we define S = (agent, turn). 
The action set consists of twelve possible strategy choices, 


A = (mmc, mmh, mnb, mkc, mkh, mkb, snc, snh, snb, skc, skh, skb) (41) 
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As previously described, each strategy specifies a bin of actions that are 
categorized according to the three descriptors. In this case we use the following bin de¬ 
scriptions. 

• m/s, access to either the full set of actions described by the following to categories 
or only those binned as smart 

• n/k, access to either kinetic or non-kinetic bins 

• c/h/b, access to either the clear , hold or build bin of actions 

The reward function provides a numeric reward signal to the agent at the 
completion of each turn as in the case of a human player. 

d. Case 2 Empirical Results 

In this case we allow the agent to select actions by agent by turn, with a 
numeric reward following each turn with an initially high level of exploration that begins 
to decay after 500 games. Following the completion of 1000 games DQ-C provided the 
lowest end of game regret, see Figure 36. 
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DQ-C Q(A) SAHSA(X) DQ-C Q( A) SARSA(\) 


Figure 36: Regret per 15 turn game and total regret over 1000 games with decisions made 
by agent by turn. 


DQ-C paired with e-greedy provided the best empirical performance, fol¬ 
lowed by DQ-C paired with Boltzmann in this case. The size of the state space and frequent 
exploration likely contributed to Q(A)’s relatively poor performance. SARSA(A) paired 
with a Boltzmann strategy rounded out the top three, see Figure 37. 
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Figure 37: Mean score over 1000 games with decisions made by agent by turn for each 
algorithm policy pair. 


Using information collected during the run we examine the value of each 
(s, a) as in the case of the battalion commander, see Table 10. 
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Table 10: Learned policy for Battalion Commander by turn for 15 turn UrbanSim game 
using DQ-C. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

9.4E-316 

4.8E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SKH 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.5E+02 

SKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SNC 

2.4E+02 

2.4E+02 

2.3E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

SNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SNB 

2.3E+02 

2.6E+02 

2.3E+02 

2.4E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.5E+02 

MKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MKH 

2.6E+02 

2.3E+02 

2.6E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.3E+02 

MKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MNC 

2.4E+02 

2.4E+02 

2.4E+02 

2.6E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

MNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MNB 

O.Oe+OO 

2.4E+02 

2.3E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 


9 

10 

11 

12 

13 

14 

15 


SKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SKH 

2.4E+02 

2.5E+02 

2.5E+02 

2.6E+02 

2.5E+02 

2.6E+02 

2.6E+02 


SKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SNC 

2.5E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.6E+02 

2.6E+02 

3.0E+02 


SNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SNB 

2.4E+02 

2.5E+02 

2.6E+02 

2.9E+02 

2.5E+02 

2.6E+02 

2.6E+02 


MKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MKH 

2.5E+02 

2.4E+02 

2.9E+02 

2.6E+02 

2.6E+02 

2.6E+02 

2.6E+02 


MKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MNC 

2.5E+02 

2.8E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.6E+02 

2.6E+02 


MNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MNB 

2.5E+02 

2.5E+02 

2.5E+02 

2.5E+02 

2.6E+02 

0.0E+00 

2.6E+02 



We can use this information to develop a policy matrix for each algorithm 
and exploration strategy, see Appendix G and Appendix H, employed that reflects the ap¬ 
plication of a greedy strategy following 1000 15 turn games. This policy matrix provides 
the recommended strategy for each agent for each game turn, see Table 11 and 12. Note 
that each algorithm exploration policy pair produces a slightly different matrix, reflective 
of the difference in the mean scores of each algorithm. 
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Table 11: Learned policy for all agents by turn for 15 turn UrbanSim game using DQ-C 
e-greedy. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

Battalion Commander 

MKH 

SNB 

MKH 

MNC 

SKC 

MNH 

MNH 

SNH 

SKB 

MNC 

MKH 

SNB 

SNH 

MNH 

SNC 

CA UNIT 

SNC 

MKH 

MKH 

MNH 

MKB 

MKB 

SNC 

SNH 

MNC 

SKH 

SNH 

MNH 

SKC 

MKH 

SNH 

ECO A 

SKH 

SKC 

SKH 

MKC 

MNB 

SKB 

MNC 

MNH 

MNB 

MKC 

MNC 

MKH 

MNH 

MKB 

SKH 

ECOb 

SNB 

MKB 

SNC 

MKC 

SNH 

SNB 

MKB 

SKB 

SNC 

SNC 

SNB 

SKH 

MNH 

MKB 

MNB 

F CO A 

SNB 

SKC 

SNH 

SKC 

SKB 

SKB 

SNC 

SNB 

MNH 

MNH 

MNC 

SNC 

MNC 

SKH 

MNC 

F CO B 

SNH 

SKH 

SKC 

MKB 

SNC 

SKC 

MKB 

SNC 

SNC 

MNC 

MKB 

SNC 

SNH 

MKC 

MKB 

G CO A 

SNB 

SKB 

MNC 

MKB 

SNH 

SKB 

MNH 

SNH 

MKC 

SNH 

SKB 

SKC 

SKC 

MNC 

SNB 

G CO B 

SNB 

MNC 

SNB 

MNH 

SNH 

MKC 

SKC 

MNH 

SKH 

MKC 

SNB 

SKC 

SKC 

SNH 

SNH 

H CO A 

SKC 

MNH 

MNH 

SNB 

SKH 

SKB 

SKH 

SNB 

SKB 

MKH 

SKC 

SNB 

SKC 

SKH 

SNC 

H CO B 

SNH 

MNB 

SNB 

MKB 

SKH 

MKB 

SKC 

MNC 

SNB 

SKC 

SNB 

MNB 

SKH 

SKH 

SNH 

QRF 

SNH 

SKH 

MNH 

MKC 

MNB 

MKH 

MNB 

SKH 

SKC 

SNH 

SKH 

MNH 

MKH 

MNH 

SKB 


Table 12: Learned policy for all agents by turn for 15 turn UrbanSim game using DQ-C, 
Boltzmann. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

Battalion Commander 

MKH 

MNH 

MNH 

SKC 

MKH 

MKB 

MKB 

MNB 

SNH 

SKB 

MNB 

SKC 

MKB 

SKC 

SNC 

CA UNIT 

MKB 

SNC 

MNB 

SKC 

MKB 

MKH 

SNB 

MNH 

SNH 

SKC 

MNH 

SNC 

SKC 

SNH 

MNH 

ECO A 

SNH 

SKH 

MKH 

SNC 

SKB 

MKC 

MNC 

SKB 

MKC 

MNH 

MKB 

MNH 

SNC 

SNC 

MKB 

ECOb 

MNB 

SNH 

MKC 

SNC 

MKH 

SKC 

SNB 

MNH 

SNH 

SNB 

SKC 

MNH 

SKB 

MNC 

SNC 

F CO A 

MNH 

MNC 

SKC 

MKC 

SNB 

SKB 

SKC 

MKH 

SNB 

SNB 

SNB 

SNH 

SNC 

MKC 

MNB 

F CO B 

MNH 

MKB 

SNB 

SNC 

MKH 

SNH 

MKH 

MKH 

SKH 

MKH 

SNH 

SNC 

MKB 

MKC 

MKH 

G CO A 

SNH 

MKB 

SNC 

MNH 

SNH 

MNC 

SKH 

MNB 

SKC 

MNB 

MKC 

MNC 

SNH 

SKB 

SNC 

G CO B 

MNH 

SNC 

MKB 

SNC 

MKB 

MKB 

SKB 

SNH 

MKC 

SNH 

SNH 

MKC 

MKB 

SKB 

MKB 

H CO A 

MKH 

SNC 

SKC 

MKC 

MNB 

MKC 

SKC 

MNB 

MNH 

MNH 

SKB 

SKB 

MNH 

SKB 

MKB 

H CO B 

SNB 

SKC 

MKB 

SKB 

SKH 

SKC 

MNB 

MNC 

MNC 

SKB 

SNC 

SKB 

SNB 

MNC 

SKH 

QRF 

SNB 

SKB 

SNB 

SKB 

SNB 

MNC 

SNB 

MNC 

SKH 

SKB 

MNB 

SKH 

MKH 

SNC 

SKB 


Using these policy tables, see Appendix H, a training developer could un¬ 
derstand how well the reward signal from the training game supported the stated learning 
objectives. 

e. Case 3 Formulation 

We adjust the formulation in this case, defining the state by agent and turn as 
for case 2, but with the action space expanded to the full problem the human player faces. 
The learning problem consists of learning how to choose actions for 11 agents across 15 
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game turns. The action space for each agent is expanded from the 12 strategy choices 


observed in the previous case to an action space in excess of 300 for some agents. The 
action space for each of the agents is shown in Table 13 


Table 13: Actions available by agent each turn 


Agent 

Actions per turn 

Battalion Commander 

42 

CA UNIT 

112 

E CO A 

341 

E CO B 

341 

F CO A 

341 

F CO B 

341 

G CO A 

341 

G CO B 

341 

H CO A 

341 

H CO B 

341 

QRF 

152 


The reward function is the same as defined previously for case 3. For this 
case we use the algorithm policy pair from case 3 that provided the best performance, DQ-C 
paired with e-greedy. 

/. Case 3 Empirical Results 

The RL system was configured to explore for the first 1000 games and then 
to begin reducing the ratio of exploration and exploitation for the next 300 games, see 
Figure 38. Initially, e was set to 1.0 due to the large state-action space and was reduced 
following the completion of game 1000 like, e = ,, £initlu 1 l ,^ r , for the remainder of the 
period with a final e = jkq- We produce a by agent by turn policy, see Table 14, for DQ-C 
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paired with e-greedy only due to run time constraints. DQ-C paired with e-greedy was 
chosen because of its previously discussed strong performance. 



Figure 38: End of game score for 1000 learning games and 300 greedy games. 


It is interesting to note that the final score from the greedy policy is higher 
than any previously observed during the learning period. 
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Table 14: Learned action policy for all agents by turn for 15 turn UrbanSim game using 
DQ-C, e-greedy. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

Battalion Commander 

13 

5 

12 

8 

31 

36 

41 

3 

40 

29 

35 

25 

8 

33 

15 

CA UNIT 

6 

88 

14 

5 

88 

51 

33 

69 

77 

38 

18 

42 

30 

38 

63 

ECO A 

178 

312 

131 

32 

152 

261 

57 

71 

234 

299 

236 

235 

318 

266 

230 

ECOb 

282 

236 

324 

314 

249 

193 

125 

114 

141 

151 

195 

104 

28 

114 

333 

F CO A 

322 

212 

16 

325 

320 

258 

60 

133 

114 

162 

200 

245 

253 

317 

290 

F CO B 

262 

170 

337 

102 

291 

331 

154 

169 

312 

2 

166 

149 

1 

142 

55 

G CO A 

321 

233 

254 

169 

77 

46 

35 

46 

187 

31 

233 

93 

28 

73 

57 

G CO B 

48 

166 

158 

327 

166 

301 

26 

305 

160 

292 

108 

111 

179 

98 

317 

H CO A 

252 

294 

139 

290 

180 

27 

177 

286 

297 

109 

338 

276 

42 

224 

236 

H CO B 

337 

20 

322 

289 

319 

11 

181 

289 

49 

78 

338 

224 

285 

332 

152 

QRF 

49 

56 

69 

37 

86 

92 

98 

108 

53 

25 

89 

46 

34 

93 

13 


Using this table, see Table 14, scenario or game developers can examine 
the paths to ensure that the desired behaviors are being reinforced by the training system. 
We can then turn these index tables into the actual action descriptions shown chosen by a 
human player, see Tables 15,16,17,18,19. 
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Table 15: Learned actions for all agents from turn 1-3 in 15 turn UrbanSim game using 


DQ-C, e-greedy. 


Agent 

1 

2 

3 

Battalion Commander 

Give Propaganda- 

Kasiriyah Quarter 

Give Propaganda- 

Market District 

Give Propaganda- 

Kasiriyah South 

Quarter 

CA UNIT 

Give Propaganda- 

Nahiyat Musalla 

Repair-Kasiriyah 

Quarter School 

Give Propaganda- 

Hisar Bakr 

ECO A 

Remove-Highway 5 

Recruit Soldiers- 

Nahiyat Abu Qasim 

Repair-Shipping 

Terminal Cement 

Plant 

ECOb 

Cordon and 

Knock-Northern 

Area 

Patrol 

Neighborhood- 

Hisar Bakr 

Recruit Soldiers- 

Kasiriyah South 

Quarter 

F CO A 

Recruit Soldiers- 

Nahiyat Artet 

Attack Group- 

Shiite Death 

Squads 

Arrest Person- 

Rushdi Kaliq 

F CO B 

Seize Structure- 

Granary 2 (IED 

Manufacturing 

Plant) 

Remove-Minaret 

Recruit Police- 

Nahiyat Artet 

G CO A 

Recruit Soldiers- 

Kassad Quarter 

Patrol 

Neighborhood- 

Nahiyat Artet 

Seize Structure- 

Mekel Bus Station 

G CO B 

Cordon and 

Search-Kasiriyah 

South Quarter 

Remove-Adnan 

Mosque 

Repair-Police 

Western HQ 

H CO A 

Seize Structure- 

Firing Range 

Cordon and 

Knock-Kasiriyah 

Quarter 

Repair-Firing 

Range 

H CO B 

Recruit Police- 

Nahiyat Artet 

Arrest Person- 

Police Colonel 

Qasim Bashir 

Recruit Soldiers- 

Nahiyat Artet 

QRF 

Remove-Nahiyat 

Artet Mosque 

Remove-Kassad 

Water Storage 

Facility 

Remove-Hwy 2 Gas 

Station 
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Table 16: Learned actions for all agents from turn 4-6 in 15 turn UrbanSim game using 


DQ-C, e-greedy. 


Agent 

4 

5 

6 

Battalion Commander 

Give Propaganda- 

Hisar Kerhad 

Host Meeting- 

DEP Mayor Dabir 

Ta’anari 

Host Meeting- 

Mayor Anwar 

Sadiq 

CA UNIT 

Give Propaganda- 

Market District 

Repair-Kasiriyah 

Quarter School 

Host Meeting- 

Police Colonel 

Qasim Bashir 

ECO A 

Treat 

Wounds/Illnesses- 

Tribe 7 

Repair-Hasim Road 

Textile Mill 

Seize Structure- 

Omar Hasim’s 

Office 

ECOb 

Recruit Soldiers- 

Malmoud Quarter 

Seize Structure- 

Highway 5 

Remove-Highway 2 

Bridge 

F CO A 

Recruit Soldiers- 

Hisar Bakr 

Recruit Soldiers- 

Hisar Sifiri 

Seize Structure- 

City Municipal 

Complex 

F CO B 

Host Meeting- 

Abim Ubaya 

Cordon and 

Knock-Kassad 

Quarter 

Recruit Police- 

Market District 

G CO A 

Remove-Water 

Treatment Plant 

Pay-Kurdish Shep¬ 
herds 

Cordon and 

Search-Kassad 

Quarter 

G CO B 

Recruit Police- 

Nahiyat Abu Qasim 

Remove-Adnan 

Mosque 

Set up Checkpoint- 

Market District 

H CO A 

Cordon and 

Knock-Hisar Sifiri 

Remove-Yaminah 

Market 

Arrest Person- 

Omar Hasim 

H CO B 

Cordon and 

Knock-Hisar Ker¬ 
had 

Recruit Soldiers- 

Hisar Kerhad 

Give Propaganda- 

Nahiyat Artet 

QRF 

Remove-Mekel 

Market 

Patrol 

Neighborhood- 

Nahiyat Ayadh 

Patrol 

Neighborhood- 

Hisar Sifiri 
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Table 17: Learned actions for all agents from turn 7-9 in 15 turn UrbanSim game using 


DQ-C, e-greedy. 


Agent 

7 

8 

9 

Battalion Commander 

Give Propaganda- 

Nahiyat Ayadh 

Advise-Police 

Colonel Qasim 

Bashir 

Host Meeting- 

Rushdi Kaliq 

CA UNIT 

Support 

Politic ally- 

Terlan Demikan 

Repair-Shipping 

Terminal Cement 

Plant 

Joint Investigate- 

Mekel Quarter 

ECO A 

Dispatch 

Individual- 

Ebrahim Hafiz 

Patrol 

Neighborhood- 

Kasiriyah South 

Quarter 

Set up Checkpoint- 

Nahiyat Ayadh 

ECOb 

Support 

Politically-Rushdi 

Kaliq 

Repair-Mekel Bus 

Station 

Repair-Highway 2 

Bridge 

F CO A 

Repair-Trash De¬ 
pot 

Support 

Politically-Rushdi 

Kaliq 

Repair-Highway 2 

F CO B 

Remove-Water 

Treatment Plant 

Recruit Soldiers- 

Nahiyat Abu Qasim 

Give Propaganda- 

Malmoud Quarter 

G CO A 

Cordon and 

Search-Kassad 

Quarter 

Remove-City Mu¬ 
nicipal Complex 

Treat 

Wounds/Illnesses- 

Tribe 1 

G CO B 

Set up Checkpoint- 

Hisar Sifiri 

Repair-Iraqi Army 

BDEHQ 

Cordon and 

Knock-Nahiyat 

Artet 

H CO A 

Cordon and 

Knock-Market 

District 

Set up Checkpoint- 

Nahiyat Abu Qasim 

Host Meeting- 

Mayor Anwar 

Sadiq 

H CO B 

Cordon and 

Knock-Hisar Ker- 

HAD 

Cordon and 

Search-Kasiriyah 

Quarter 

Pay-Tribe 3 

QRF 

Seize Structure- 

Al-Hamra’ Cen¬ 
tral School 

Remove-Highway 2 

Bridge 

Dispatch 

Individual-Omar 

Hasim 


140 




Table 18: Learned actions for all agents from turn 10-12 in 15 turn UrbanSim game using 


DQ-C, e-greedy. 


Agent 

10 

11 

12 

Battalion Commander 

Host Meeting- 

Rushdi Kaliq 

Host Meeting- 

Ebrahim Hafiz 

Give Gift-Ali Bakr 

Obeidi 

CA UNIT 

Joint Investigate- 

Mekel Quarter 

Treat 

Wounds/Illnesses- 

Tribe 1 

Joint Investigate- 

Nahiyat Artet 

ECO A 

Set up Checkpoint- 

Nahiyat Ayadh 

Patrol 

Neighborhood- 

Hisar Bakr 

Patrol 

Neighborhood- 

Kasiriyah Quarter 

ECOb 

Repair-Highway 2 

Bridge 

Remove-Al-Qassas 

Brigade Safehouse 

Host Meeting- 

Jafar Udnan 

F CO A 

Repair-Highway 2 

Remove-JAAS 

Safehouse 

Seize Structure- 

Al-Hamra’ City 

Hospital 

F CO B 

Give Propaganda- 

Malmoud Quarter 

Remove-Adnan 

Mosque 

Repair-Granary 2 

(IED Manufactur¬ 
ing Plant) 

G CO A 

Treat 

Wounds/Illnesses- 

Tribe 1 

Patrol 

Neighborhood- 

Nahiyat Artet 

Joint Investigate- 

Hisar Kerhad 

G CO B 

Cordon and 

Knock-Nahiyat 

Artet 

Host Meeting- 

Asad 

Host Meeting- 

Terlan Demikan 

H CO A 

Host Meeting- 

Mayor Anwar 

Sadiq 

Recruit Police- 

Kasiriyah South 

Quarter 

Seize Structure- 

Sewage Processing 

Facility 

H CO B 

Pay-Tribe 3 

Recruit Police- 

Kasiriyah South 

Quarter 

Patrol 

Neighborhood- 

Malmoud Quarter 

QRF 

Dispatch 

Individual-Omar 

Hasim 

Patrol 

Neighborhood- 

Nahiyat Musalla 

Remove-Bakr Dis¬ 
trict School 
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Table 19: Learned actions for all agents from turn 13-15 in 15 turn UrbanSim game using 
DQ-C, e-greedy. 


Agent 

13 

14 

15 

Battalion Commander 

Give Propaganda- 

Hisar Kerhad 

Host Meeting- 

Police Colonel 

Qasim Bashir 

Give Gift-Fouad 

SULIMANI 

CA UNIT 

Pay-Tribe 4 

Joint Investigate- 

Mekel Quarter 

Support 

Politically-Jafar 

Udnan 

ECO A 

Recruit Soldiers- 

Mekel Quarter 

Seize Structure- 

Al-Qassas Brigade 

Safehouse 

Patrol 

Neighborhood- 

Hisar Kerhad 

ECOb 

Treat 

Wounds/Illnesses- 

Kurdish Shepherds 

Support 

Politically-Rushdi 

Kaliq 

Recruit Police- 

Nahiyat Musalla 

F CO A 

Seize Structure- 

Transformer 

Station 

Recruit Soldiers- 

Nahiyat Musalla 

Cordon and 

Knock-Hisar Sifiri 

F CO B 

Give Propaganda- 

Nahiyat Abu Qasim 

Repair-Kasiriyah 

Quarter School 

Release Person- 

Jafar Udnan 

G CO A 

Treat 

Wounds/Illnesses- 

Kurdish Shepherds 

Dispatch 

Individual-Mayor 

Anwar Sadiq 

Release Person- 

Police CPT AL- 

Nasser 

G CO B 

Remove-Artet Dis¬ 
trict School 

Joint Investigate- 

Kasiriyah Quarter 

Recruit Soldiers- 

Nahiyat Musalla 

H CO A 

Cordon and 

Search-Nahiyat 

Musalla 

Patrol 

Neighborhood- 

Malmoud Quarter 

Patrol 

Neighborhood- 

Hisar Bakr 

H CO B 

Cordon and 

Knock-Hisar Ad- 

nan 

Recruit Police- 

Nahiyat Musalla 

Repair-Hasim Road 

Textile Mill 

QRF 

Remove-Al- 

Hamra’ City Hospi¬ 
tal 

Patrol 

Neighborhood- 

Kassad Quarter 

Dispatch 

Individual-Fouad 

SULIMANI 


3. Insights on the Verification of Training Simulations 

One of the chief insights was the notion of treating the training simulation as a ban¬ 
dit problem. Viewing the problem in this manner, the challenge from the design standpoint 
is to specify the reward on each arm so as to correctly reinforce the behaviors associated 
with the learning objectives. Since the learning objective is at the strategy level, the mean 
value of the action bin associated with each strategy specifies the reward on each of the 
strategy arms. This is equivalent to a noisy bandit problem. Using this paradigm, the level 
of difficulty of the training simulation could be directly controlled by adjusting the noise on 
the reward signal. Following an easy-to-hard progression, the level of noise associated with 
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the strategy arms could start out at 0 and be incrementally adjusted based on the trainees 
progression, with a goal of trainees maintaining the optimal policy even in a noisy environ¬ 
ment in the most difficult cases. The development of optimal polices for each agent could 
also support the use of semi-automated forces to allow trainees to focus on a single role 
in the game. This application areas is open to much future research that builds upon the 
results documented here and in the master’s thesis of U.S. Army Major Brian Voght, whose 
research was supported by this effort. 

D. CONCLUSIONS 

In this section, we demonstrated novel applications of RL within three different 
use cases relevant to military modeling and simulation, the assignment and scheduling of 
unmanned systems, the representation of human decision making in a combat simulation, 
and in the verification of the reward structure in a training simulation. 

In each case, the use of RL was unique and served to address known deficiencies or 
needs within the target simulation. In ASC-U, we observed that RL was could overcome 
specific case in which the finite horizon optimization approach failed. In COMBATXXI, we 
observed that the incorporation of RL in specific decision points where domain knowledge 
is available could provide adaptive behaviors more representative of human behavior and 
potentially represent the various training levels present in a real population. In UrbanSim, 
RL demonstrated its utility as an approach to address the challenge of verifying that the 
reward signal provided to a trainee was reinforcing the state learning objectives. In each 
case, the performance of DQ-C was competitive with the comparison algorithms. 

• The definition of state and the size of the state space greatly impact the learning rate 
of RL techniques. 

• The use of domain knowledge, as a human would, can greatly improve the perfor¬ 
mance of RL methods. 
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• The identification of the appropriate decisions that potentially require RL, such as 
decision nodes on a decision tree, can guide the effective use of RL. 

• In application within combat simulation, the use of a training period to allow be¬ 
haviors to form is likely required prior to the conduct of record analysis runs. 

• The use of RL as a means of verifying the reward structure within training simu¬ 
lations is a natural fit for those cases where the verification does not lend itself to 
standard design of experiments techniques. 

In the next chapter we develop a practical cognitive architecture for use in discrete 
event simulation that places RL into the broader context of human behavior models. 
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V. DEVELOPMENT OF A PRACTICAL COGNITIVE 

ARCHITECTURE 


We appeal to ordinary perception to arrive at our physical theories, yet those 
same theories seem to undermine that everyday perception, which is rich in 
meaning. - Bertrand Russell, Analysis of Matter (1927) 

The purpose of this section is to review the literature and current state of the art 
regarding cognitive social simulations, cognitive architectures and their application in sup¬ 
port of military decision making and analysis. Cognitive social simulation, a derivative of 
agent based social simulation, combines the use of cognitive architectures with traditional 
agent based social simulation approaches. 

This section develops a practical cognitive architecture that places reinforcement 
learning within the broader context of human behavior models. We develop a conceptual 
model of the cognitive architecture and describe its implementation in a discrete event 
simulation framework. We also describe the application of reinforcement learning and the 
cognitive architecture within an agent based social simulation. Results reported in this 
section were previously discussed in several conference proceedings, an accepted in-press 
journal article (S. Papadopoulos et al., 2013), and student thesis work supported by this 
effort (J. K. Alt et al., 2011; Ozcan et al., 2011; Ozkan, 2011; M. Papadopoulos, 2010; 
Pollock et al., 2011; McKaughan, 2011). 

A. GENERAL FRAMEWORK AND IMPLEMENTATION DESCRIPTION 

The cognitive architecture proposed in this research provides a minimalist approach 
for modeling human decision making based on the concept of situation. While multiple 
cognitive architectures exist in the literature, the framework proposed here seeks to in¬ 
corporate the impact on relevant concepts from cognitive science, psychology, and social 
psychology in a relatively simple manner. The intent is to avoid a kitchen sink approach 
by identifying a framework to account for the influence of these notions using the smallest 
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number of concepts and parameters possible. The prototype architecture provides a frame¬ 
work for experimentation with software agents for use in agent based social simulations 
with potential for the use of the architecture in conjunction with empirical data collection 
efforts. The need for an agent decision making architecture centered on the recognition of 
a given situation is highlighted by the literature on decision making and the need to reduce 
complex state spaces in agent environments (Klein, 1993; Russell & Norvig, 2010). 

Agent architectures capable of recognizing relevant situations enable the use of 
algorithms such as RL (Sutton & Barto, 1998). RL provides multiple techniques to enable 
software agents to select actions in given situations based on a reward policy specified by 
the modeler. The use of utility based rewards allows these policies to be tailored to the 
desired use case and role (J. K. Alt et al., 2011). 



Figure 39: Practical cognitive architecture full conceptual model. 
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Applications such as battlefield command and control systems and agent based so¬ 
cial simulations require agents capable of allocating selective attention to relevant percepts 
in a given context. This combination of bottom up and top down processing in conjunc¬ 
tion with the constraints of working memory facilitate the construction of a situation. The 
framework allows for the representation of human behavioral phenomena such as change 
blindness, where changes in a scene are not observed due to the effects of top down pro¬ 
cessing on selective attention. This framework also provides a mechanism for agents to 
participate in collective learning within a social network, to determine which agents to 
communicate with and to determine what messages should be attended to. 

This section describes the proposed cognitive architecture from a conceptual stand¬ 
point. 

1. Perception 

The main function of the perception module is to form a situation constrained by 
the limits of working memory and informed by selective attention. Percepts arrive to the 
perception module via sensors that sense information from the environment and from the 
internal agent feedback mechanisms. Note that this architecture treats inter-agent commu¬ 
nications through the receipt of information via percepts and the decision to communicate 
via action selection. 

Percepts are screened for relevance based on selective attention and if found rel¬ 
evant to the current situation are processed into working memory. Selective attention is 
driven by top down processing from the task and context (Anderson, 2005; Anderson & 
Schunn, 2005). Selective attention is influenced by the current motivations and emotions 
from the meta-cognition module, which serves to identify the goal. In communications 
selective attention is informed by information regarding the relationship with the other 
agent and notions such as trust. Working memory is limited to 7-10 percepts, the gener¬ 
ally accepted limit (Wickens & Hollands, 2000). The final set of percepts is considered a 
description of the current situation, considering both external perceptual information and 
information from the agents internal state. 
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2. Meta-cognition 

The meta-cognitive module provides the agents top-down direction based on moti¬ 
vations and emotions elicited by a given situation input from the perception module. Meta¬ 
cognition is broadly defined as any cognitive process that monitors or controls other aspects 
of cognition or thinking about thinking (Davis & Venkatamuni, 2010). Meta-cognition is 
described by Flavel as occurring in three phases: 1) meta-cognitive knowledge stores in¬ 
formation regarding the environment, task, and known strategies; 2) meta-cognitive expe¬ 
rience stores information describing previous means of achieving a given result; 3) meta- 
cognitive regulation describes the process of monitoring and controlling progress on cog¬ 
nitive tasks (Flavell, 1979). 

The meta-cognitive module also hosts the agents information regarding the motiva¬ 
tion of agents behavior. The input to the meta-cognition module is the most recent situation 
provided by the perception module. Using this updated situation the meta-cognition mod¬ 
ule conducts an update to determine which motivations are active and to assess the impact 
of the new situation on its goal state. The situation object in conjunction with information 
from long-term memory is used to form expectations about likely future situations (J. K. Alt 
et al., 2011). Goals and methods are selected using input on motivation, emotional state, 
and expectations in conjunction with long-term memory. As a result of this step selective 
attention is updated based on the new goal and the updated situation and goal are passed to 
the action selection module. 

3. Long-term Memory 

Long-term memory stores information learned over-time for future retrieval based 
on the situation. Reward histories from prior action selections as well as long-term beliefs 
and issue stances are maintained in long-term memory. Relevant actions for given situa¬ 
tions as well as mappings of equivalent situations can be returned from long-term memory 
based on need in a given situation. 
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4. 


Action Selection 


Situation based action selection facilitates the reduction of the state space of the 
model through the notion of equivalent states being categorized as unique situations. For 
each unique situation there exists a set of relevant candidate actions. Actions have an 
associated activation level provided by a utility based RL algorithm (Sutton & Barto, 1998). 
If the agent has enough experience, defined as a specified number of trials of each action, 
then the agent action selection is controlled by a softmax function, such as the Boltzman 
distribution, with a greedy setting, replicating recognition prime decision making (Klein, 
1993). If the agent has some level of experience in the situation then action selection can 
still be conducted using the softmax function, but with an exploratory setting. If the agent 
has no experience in the situation, then mental simulation is conducted, with the agent 
using available knowledge regarding the environment to project future states based on the 
actions currently available (Klein, 1993; Kunde, 2005). An alternative to the case where 
sufficient experience is not present is to base the decision making mode on the risk level 
associated with the given situation. In this formulation, if the requisite experience to use 
recognition prime decision making is not present, when risk is low the agent simply uses 
the softmax function with an exploratory setting, while if risk is high the agent uses mental 
simulation. 

B. GENERIC IMPLEMENTATION DESCRIPTION 

This section provides an overview of the implementation of the conceptual model in 
discrete event simulation (DES) (A. H. Buss & Sanchez, 2002; A. Buss, 2002; A. H. Buss 
& Sanchez, 2005; A. Buss, 2009). The conceptual model was implemented by Mr. Harold 
Yamauichi of Roland’s and Associates as part of TRAC-MTRY’s research on Irregular 
Warfare. Discrete event simulation is a form of simulation that represents phenomena of 
interest from the real-world through with state variables, parameters, events and schedul¬ 
ing edges. It is distinguished from time-stepped models in its handling of time. In a time 
stepped model, time advances at a fixed-time increment throughout the course of the sim- 
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ulation. In a DES model time advance is controlled by a master event list, which advances 
time in uneven increments in accordance with the next scheduled event (A. Buss & Blais, 
2007; A. Buss, 2001). 

Event graphs provide a graphical representation of a discrete event simulation. In 
this form, nodes represent events and edges represent scheduling relationships between 
events. State variables change in a piecewise constant manner with the occurrence of events 
within the simulation in accordance with transition functions contained in events. The 
cognitive architecture consists of five main components at the top-level: PerceptUmpire, 
Perception, MetaCognition, ActionSelection, and LongTermMemory. 



Figure 40: top-level view of DES cognitive architecture. 


The PerceptUmpire serves as the interface between the agent and the environment. 
This class packages information about the environment and the internal state of the agent 
into atomic percepts that are provided to the agent in accordance with the rules provided 
by the modeler. This is the class that should require the most extensive work in regard to 
integrating the cognitive architecture into other environments. The PerceptUmpire not only 
provides a means of interfacing with the environment but also ensures that the modeler can 
control the information that the agent can access. This reduces the likelihood of the agent 
gaining access to unintended or unrealistic information from the environment. Since many 
of the capabilities being analyzed in support of the future force are intended to provide 
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timely information to decision makers, the need to ensure that only information that a 
particular command and control system would provide is included. 

The PerceptUmpire consists of two events: Receivelnformation and PerceptArrive. 
The Receivelnformation event listens for events within the agent class and events within 
the environment, with supporting signatures for each event type that the agent has access 
to and packages the information into an atomic percept. This percept is then passed to the 
agent which treats it as an arrival event. 

The Perception component filters percepts received based on relevancy, fills work¬ 
ing memory and forms a situation object composed of the atomic percepts stored in working 
memory. The Perception component treats the arrival of each percept as an arrival event. 
The Perception component contains the following events: Arrival, RelevancyCheck, Rel- 
evantPercept, ProcesscurrentSituation, EndProcess, and StartMetaCognition. The Arrival 
event schedules a RelevancyCheck event. The relevancy check is intended to filter per¬ 
cepts based on their age and on their salience given the agents current goals. A percept is 
considered relevant if its time stamp is recent enough and its type is salient to the current 
goal. RelevantPercept is scheduled with a Boolean scheduling edge, and once the number 
of relevant percepts fills the available working memory a ProcessCurrentSituation event is 
scheduled. Working memory is constrained in accordance with the literature on human 
information processing. ProcesscurrentSituation creates a situation object and schedules 
EndProcess which in turn schedules a StartMetaCognition event. These events are further 
organized into three module: Selective attention, Working memory, and Situation forma¬ 
tion. 

In MetaCognition the situation object is used to determine current relevant moti¬ 
vations, assess the agents internal state in the given situation, select goals and methods to 
achieve them and updates long-term memory. From a high level meta-cognition contains 
two modules. MetaCognition listens for the StartMetaCognition event in the Perception 
component. StartMetaCognition takes the situation object formed in the Perception com¬ 
ponent and schedules an UpdateLongTermMemory event and an UpdateMotivation event. 
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The UpdateMotivation event takes the situation object and determines the change to the 
agents perceived motivations based on the new information. The view on motivation taken 
in this work is inspired by Kenricks work to update Maslows hierarchy of needs, see Figure 
41,but the general framework need not be limited to this set for the general case (Kenrick, 
Griskevicius, Neuberg, & Schaller, 2010). An activation level for each motivation is calcu¬ 
lated based on the agents perception of the needs associated with each motivation. As the 
needs for a given motivation are satisfied the activation level of that motivation is reduced. 
The UpdateMotivation event schedules a CognitiveAppraisal event. This event determines 
an overall level of satisfaction for the agent based on the current motivations. If the agent is 
completely satisfied all of its needs are met and its motivations are equally weighted. The 
CognitiveAppriasal event schedules a FormExpectation event. The FormExpectation event 
uses the current situation to determine the expected satisfaction level in the next most likely 
situation based on its experience. The difference between current and expected satisfaction 
is used to determine whether the agents outlook is optimistic or pessimistic. The Cogni¬ 
tiveAppraisal event schedules a GoalsAndMethods event. This event determines the most 
relevant goal based on the motivations selects a top-level method to achieve that goal from 
the set of relevant methods using RL. If the goal has not changed from the current goal, 
then the agent continues with its currently scheduled behavior. 
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Figure 41: Kenrick’s updated to Maslow’s hierarchy of needs. 


The LongTermMemory component contains the agents history based on its expe¬ 
rience in the environment as well as long-term belief networks related to its view of the 
world. Event and reward histories are stored for use during RL. The belief networks docu¬ 
ment the agents world-view on certain issues and is represented in a Nave Bayes network. 
Otherwise the GoalsAndMethods event in turn schedules an Identify Behaviors event which 
returns the behaviors associated with the chosen method. The ActionSelection component 
identifies the behaviors associated with the chosen method and determines the appropri¬ 
ate method of decision making based on the agents experience with these behaviors. This 
component was inspired by Kleins recognition prime decision making. Once an action 
is selected the agent schedules the action within the environment. The ActionSelection 
component listens for an IdentifyBehavior event. The IdentifyBehavior event schedules an 
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IdentifyDecisionMethod event. Based on the experience of the agent in the situation, as 
reflected by the number of trials of each of the relevant actions associated with the selected 
method, either a RecognitionPrimedDecisionMaking event, an ExplorationLearning event, 
or a MentalSimulaiton event is scheduled. Each of these methods of decision making re¬ 
sults in the scheduling of a SelectAction event which in turn schedules an InvokeAction 
event. 

The cognitive architecture was implemented and tested by Harold Yamauichi of 
Roland’s and Associates targeted for use in the Cultural Geography model. In the next 
section, we review the literature on social cognition in order to understand how cognitive 
architectures relate to the representation of group cognition, a relevant application areas 
where we have applied cognitive architectures to represent the decision making of a popu¬ 
lation segment within a conflict environment (J. Alt et al., 2009). 

C. COGNITIVE ARCHITECTURES TO REPRESENT GROUP COGNITION 

The concept of social cognition can take on varied meanings depending on the 
discipline of the researchers who are using it. In this research social cognition refers to 
the social processing and construction of information. The concept is grounded in the 
literature on psychology, with roots in the cognitive revolution, and is now largely accepted 
by contemporary social psychologists. The notion of social cognition also traces its roots 
to sociological theory, where as early as 1898, Durkheim proposed a theory of collective 
psychology, suggesting that members of social groups took part in a “collective conscience” 
and made use of a common knowledge base (Howard & Renfrow, 2006). The concept of 
prototypes, similar to Weber’s notion of “ideal types”, is supported by the use of a social 
information processing system, allowing group perceptions of other groups to influence 
their behavior in regard to the other group (Howard & Renfrow, 2006). 

Social exchange theory, which assumes individuals and groups can be motivated by 
rewards, benefits from social cognition, allowing for the consideration of beliefs and atti¬ 
tudes in decision making (Cook & Rice, 2006; Howard & Renfrow, 2006). Note that this 
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also provides us a theoretical basis for the use of reinforcement learning at the group level. 
Expectation states theory attempts to explain the process by which humans and groups de¬ 
termine their roles and status in social situations, typically studied in task oriented groups 
(Correll & Ridgeway, 2006; Howard & Renfrow, 2006). In this context expectation states 
theory helps explain how expectations are formed regarding social identities within catego¬ 
rized groups based on status characteristics to eventually form representations of expected 
behavior for each group (Correll & Ridgeway, 2006). Rankings of power and prestige 
are generated based this “collective” reward system. Again we see a linkage to the use of 
rewards to shape behavior. Critical social psychology seeks to identify the role of power 
in social processes (Howard & Renfrow, 2006). The formation of a perception of other 
groups, central to the categorizations necessary to conduct the ranking needed for social 
interactions, leads to the need for social cognition. The elements of social cognition are 
broken into two categories by Howard et al., cognitive structures and cognitive processes 
(Howard & Renfrow, 2006). 

In this summary we see that the notion of an information processing system present 
at the group level matches with the theory from social psychology and that the notion of 
behavior shaped by rewards exists at the group level as well. 

1. Cognitive Structures 

Cognitive structures serve as a store of information in verbal form, with belief, atti¬ 
tudes, and values as examples of early structural forms. Social cognition proposes that 
information regarding other individuals or groups is stored in a prototype, constructed 
through a synthesis of all information regarding the other group. These prototypes are 
used to rapidly assess members of these other groups. Exemplars provide an alternative to 
prototypes, but are based on specific experiences with the other group rather than the aver¬ 
age. Schemas provide yet a third alternative, allowing the application of social knowledge 
to the entire information processing system through their impact on the organization of 
knowledge. Schemas can be used to store knowledge regarding individuals or groups, per¬ 
son schemas, about self (individual or group), self-schemas, regarding roles, and in relation 
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to events. Social representations have been proposed as a social alternative to schemas, be¬ 
ing defined as knowledge that has been generated by consensus and shared with the group, 
“commonsense theories about how the world works (Howard & Renfrow, 2006).” 

Memory studies have illustrated that memory is shaped by the social context and 
existing cognitive structures. Cohen in particular demonstrated this with experiments in 
which subjects were shown a video of a woman having dinner with her husband. Subjects 
were told the woman was either a waitress or a librarian and were asked to recall details 
regarding the video. The subjects responses were consistent with the role they perceived her 
to be filling, with memories of the actual content being replaced by the schema associated 
with that particular role. Wegner shows that memory occurs at the group level as well as the 
individual, with groups dividing information among group members (Howard & Renfrow, 
2006). 

The impact of language on social cognition is strong, since the mechanism for the 
development and storage of group memories is communication facilitated by language. 
Group perception of information transmitted in the same language depends on the situa¬ 
tion and the categorization of the sending group (Howard & Renfrow, 2006). The social 
knowledge base for a group then might contain facts in natural language that carry specific 
meaning for that group distinguished from other groups. From a production system view¬ 
point, the processing of a fact by two groups might result in the activation of completely 
orthogonal rules by the production system based on the difference in the social knowledge 
base. 

The exploration of ideologies, attitudes, beliefs, and behavior by social psycholo¬ 
gists provides a starting point for the identification of social facts for a given population 
subgroup. Ideologies, values, and attitudes are distinguished in this literature by their level 
of abstraction. Attitudes are generally held toward objects, values tend to focus on notions 
such as freedom, and ideologies relate to sets of values and attitudes. Despite the difference 
in level of abstraction these concepts share the following attributes: 1) they are all evalu¬ 
ative expressed as a positive or negative value toward an object; 2) they are all subjective 
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based on the individual or groups perception; 3) they all can be either active or inactive 
based on the situation; 4) they are interrelated (Maio, Olson, Bernard, & Luke, 2006). Atti¬ 
tudes are more influenced through direct experience, while values and ideologies rely more 
on indirect experience with potential sources identified as the family, media, and cultural. 
Behavior has more closely been linked to attitudes than values or ideologies (Maio et al., 
2006). Viewed from an information processing perspective then attitudes, ideologies, and 
beliefs maintained in social memory, might be used to interpret a given situation and select 
the appropriate action. Decisions regarding intergroup relations, shown to rely on social 
cognitive processes such as social categorization or stereotyping and the need for groups 
to view themselves positively, result from this collective information processing (Brubaker, 
Loveman, & Stamatov, 2004; Hogg, 2006). The central importance of ideologies, values, 
attitudes, and behavior in the development of group cognitive architectures captured in the 
following quote regarding the September 11, 2001 attacks: 

In particular, the extremists actions have been regarded as an inevitable conse¬ 
quence of their peculiar mix of Islam and conservative ideology, their lack of 
respect for innocent human beings, and their hatred toward the United States. 

In other words, the extremists behavior has been partly regarded as a product 
of their ideologies, values, and attitudes. (Maio et al., 2006) 

2. Cognitive Processes 

Cognitive processes in social cognition leverage knowledge organized in the cog¬ 
nitive structures (Howard & Renfrow, 2006). The social information processing system 
mirrors the standard information processing paradigm with subtle differences. Salience 
of a stimulus is known to be task dependent in visual information processing (Anderson, 
2005). In the case of social information processing, the social meaning of percepts from 
the environment define the social salience of the information. In a similar manner, percepts 
regarding other groups that contradict expected behavior tend to be more socially salient 
(Howard & Renfrow, 2006). 
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The social cognition view of cognitive inference is geared for the use of the in¬ 
formation gathered by an individual or group to make some social judgment (Howard & 
Renfrow, 2006). In a social context, we rely on prototypes of groups to make inferences 
about groups or group members we encounter. The group prototype we follow in regard 
to a given social entity also impacts the selective attention we allocate tof the individual 
or group (Howard & Renfrow, 2006). During a meta-cognitive step individuals or groups 
transform the perceived information using heuristic shortcuts as implied by the informa¬ 
tion processing model from cognitive science. Group meta-cognition has been empirically 
studied in the context of group problem solving, with the value of group meta-cognition 
increasing with the difficulty of the problem (Iiskala, Vauras, Lehtinen, & Salonen, 2011; 
Wilson, 2001). 

Humans make decisions that tend to confirm their pre-existing belief, in spite of 
new evidence (Howard & Renfrow, 2006). Using limited information people tend to make 
causal inferences regarding the factors that produced a given outcome in a given situation, 
assigning characteristics to individuals or groups, associated with situations they encounter. 
Jones proposed that trait attributions resulted from the evaluation of observed situational 
behavior and prior expectations regarding other groups influenced by social norms. Several 
studies demonstrated that those viewed as being in positions of power had their behavior 
negatively categorized more frequently than others (Howard & Renfrow, 2006). 

3. Sentiment 

The impact of emotion on social information processing has been documented in 
a number of empirical studies (Jackson & Sullivan, 1989). Stets equates emotion and 
sentiment in her treatment of experiences that result from the combined influences of the 
biological, the cognitive, and the social and defines sentiment as “distinctly social in that 
individuals leam through socialization with others the names of the internal sensations they 
experience and the social norms regarding their appropriate expression (Stets, 2006). Stets, 
referencing Thoits, cites four related elements that comprise an emotion: 1) situational 
cues; 2) physiological changes; 3) visually expressive gestures; 4) the socially defined 
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label that describes the combination of the other three elements currently being experienced 
(Stets, 2006). Stets attributes a list of contemporary primary emotions to Turner: assertion- 
anger, aversion-fear, disappointment-sadness, and satisfaction-happy (Stets, 2006). These 
contemporary views are founded on work with at the individual, identifying five common 
emotions based on facial expressions, and work to map these primary emotions to the field 
of sociology. In the sociological view, combinations of these primary emotions can lead to 
secondary emotions, or social emotions. 

Emotional intelligence and emotional competence both describe the role that emo¬ 
tion plays in information processing. Emotional intelligence is described as the ability to 
perceive and express emotion, to integrate emotion into reasoning, to understand emotion, 
and to control or manage emotion (Stets, 2006). The biological view of emotion focuses 
on emotions as expressions of internal feedback from the body, while the social structural 
approach relates sentiment to changes in power and status. The cultural perspective traces 
emotions to cultural norms, which provide information on how to feel in a given situation. 
A difference in generated emotions and culturally expected emotions describes emotional 
deviance in this paradigm. The symbolic interactionist framework relates emotion most 
closely to internal processes within individuals. Affect control theory, one variant of this 
paradigm, describes how emotions arise when internal feedback conflicts with the individ¬ 
uals perceived identity in the situation (Stets, 2006). This is not dissimilar to the cultural 
view. 


4. Group Behavior and Deviance 

Oakes suggested that group collective properties can explain behavior, supported 
by Allison and Messick who provide an account of the attribution of opinions derived from 
group decisions to individual members, even when collective decisions were not made 
(Howard & Renfrow, 2006). Power relationships also influence the attributions of group 
members and subsequent intergroup relationships. Attribution processes between groups 
are shaped by historical, economic, and political contexts of intergroup relations and is 
consistent with Fisher’s narrative paradigm. This favoring of closely related groups results 
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in a greater allocation of selective attention to those groups categorized as most like the 
observing group, while relying on more basic schemas for those groups that are not as 
relevant. One result of this difference in the level of detail is illustrated by the extreme 
evaluations we tend to place on those groups with which we interact and attend to less 
frequently (Howard & Renfrow, 2006). Visual traits often are used for categorization, 
and when these traits occur in conjunction with a difference in resources the group with 
more resources will usually be assigned more status. This categorization process leads 
to the formation of group identities, which tend to be more relevant for those that are 
disadvantaged than high status groups, consistent with social identify theory (Howard & 
Renfrow, 2006). This is consistent with work on deviance, which uses the shared normative 
expectations of groups as mechanism by which groups evaluate behaviors and world views 
of individuals or other groups (Kaplan, 2006). Those that deviate from established norms 
are described as doing so from motivated deviance, usually due to a lack of motivation to 
adhere to social norms, or unmotivated deviance, usually in spite of the individuals efforts 
to adhere (Kaplan, 2006). Groups define social norms, which eventually might become 
legitimatized as laws of a country established by the dominant group. Motivations and 
goals play an important role in the adherence of members of a group to established norms. 

The use of cognitive architectures to represent group cognition within social simu¬ 
lation has not been explored extensively in the literature, but the use of this type of informa¬ 
tion processing framework at the group level seems reasonable given the social psychology 
literature. The following section will describe the empirical performance of the cognitive 
architecture previously described when embedded within the CG model. 

D. APPLICATION OF THE REINFORCEMENT LEARNING AND A PRACTI¬ 
CAL COGNITIVE ARCHITECTURE WITHIN THE CULTURAL GEOG¬ 
RAPHY MODEL 

This section describes the application of RL to the Cultural Geography model be¬ 
ginning in 2010 with the incorporation of RL methods into the representation of the Theory 
of Planned Behavior to drive action selection (S. Papadopoulos et al., 2013). Subsequently, 
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this research also developed an RL based representation of trust that was incorporated into 
the agent-level decision regarding the selection of targets for communications (Pollock et 
al., 2011). A methodology to develop scenarios from existing data was also provided and 
demonstrated using open source data sources (McKaughan, 2011). Finally, a proof of prin¬ 
ciple cognitive architecture based on the conceptual model described incorporating RL was 
developed and implemented (J. K. Alt et al., 2011). 

1. Initial Application of Reinforcement Learning within CG 

Initially, RL was used to control the behavior of threat agent, to more clearly un¬ 
derstand how RL might be applied within a complex social simulation. The scenario used 
represents an area of Kandahar province in Afghanistan, with a civilian population of 350 
agents, insurgent, host nation, and stabilizing forces, which communicate within a dynamic 
social network. The population agents take actions to meet basic needs. Threat forces at¬ 
tempt to reduce the satisfaction of the population on a set of issues related to stability, while 
host nation and stabilizing forces seek to improve the populations satisfaction on this issue 
set. Figure 48 shows the methodology used to develop the scenario and conduct analysis, 
while Figure 49 shows a functional decomposition of a generic population agent. 
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Figure 42: Analysis methodology for close formed use case. 
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Figure 43: Population agent functional decomposition. 


The initial implementation of RL within the CG model focused on enabling the 
threat agents to use RL to control their action selection. There is one insurgent agent per 
region each with four action choices: 

• DoNothing: The agent performs no action. 

• KillCivilServant: The agent makes an assassination attempt against a civil servant. 

• IED: The agent plants an improvised explosive device (IED) against any target. 

• IED AN S f : The agent plants an IED targeting the Afghan National Security Forces 
(ANSF). 

The problem is similar in nature to a noisy bandit problem with a reward function 
providing the insurgents feedback following each action opportunity in the presence of a 
variety of competing actions selected by the host nation and stabilizing forces. In this 
simple case two agents are compared: an RL agent, Tall, and a standard agent, Tal2. The 
distribution of actions selected by each agent is shown in Figure ??. KillCivilServant 
provides the best reward to the threat agent and we see that the threat agent does select this 
action more often than the standard agent. 
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Figure 44: Distribution of threat actions. 


This simple case illustrated that the RL agent responded to the dynamics of the 
environment rather than performing its actions according to a predetermined script. This 
approach served to reduce scenario and data development time and provide more realistic 
results. Building on the initial implementation of this technique in guiding the behaviors 
of insurgent actors, the next section describes the expansion of this research to explore 
the applicability of the use of reinforcement learning in the representation of the theory of 
planned behavior. 

2. Representing Theory of Planned Behavior 

Previous iterations of the CG model implemented the theory of planned behavior 
through the use of Bayesian belief networks. This approach, which resulted in a separate 
belief network for each agent, required extensive subject matter expert input and served to 
greatly increase runtime during execution. In order to reduce the data requirements and to 
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link the agent responses more closely with the dynamics of the model, an implementation 
of the theory of planned behavior that relied on RL was developed and incorporated into 
the model. This resulted in reduced runtime and a reduced data development process and 
model output that responded to the dynamics of the model. The research discussed here 
has been accepted for journal publication (S. Papadopoulos et al., 2013). 

The Theory of Planned Behavior (TPB) provides an empirically grounded concep¬ 
tual model of the formation of human behavioral intention (Ajzen, 1991). The theory of 
planned behavior states that behavioral intentions are formed by a combination of input 
from perceived subjective norms toward the behavior, internal attitude toward the behavior, 
and perceived behavioral control regarding the behavior. Subjective norm (SN) refers to 
the opinions of those in the individuals peer group regarding the behavior. The components 
SN are: the normative belief strength, nb, and motivation, m, to comply with the normative 
belief summed across the relevant members of the individuals social network, n. 

n 

SN^^N.mi (42) 

i 

Attitude (A) refers to the individuals own beliefs regarding the behavior in question. 
The attitude is expressed as the product of belief, b, and the outcome evaluation, e, an 
evaluation of the value of the potential outcome (Ajzen, 1991). 

n 

A = ^b iei (43) 

i 

Perceived behavioral control (PBC) describes the individuals perception of the like¬ 
lihood that they will succeed in the execution of the behavior given that they attempt it. 
Control beliefs, cb, serve as the likelihood estimate, while perceived facilitation, pf, pro¬ 
vides the value estimate (Ajzen, 1991). 


n 


PBC-^Ctpi 

i 


( 44 ) 
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The model, as formulated by the (Ajzen, 1991), is an expected value model, with 
a likelihood and payoff derived for each component. The linear sum of these components 
provides a behavioral intention (BI) score for that individual for the behavior in question. 

BI = A + SN + PBC (45) 

In order to adapt the theory of planned behavior for use in the CG model we im¬ 
plement the theory of planned behavior using a reinforcement learning based approach. 
Perceived behavioral control (PBC) can be thought of as the discounted reward history for 
the actions associated with the behavior of interest. 

PBC a = ^A tl_t u i H(t i >t),Vae A (46) 

i 

The subjective norm (SN) is represented as the average activation levels of the k 
nearest neighbors within the social network. 

SN q = ^ =lPBCai ,Va6A (47) 

k 

The attitude (A) toward the behavior is likely the most difficult to generalize, since 
social simulations represent this in a variety of ways if at all. For the purposes of this 
example, it can be thought of as the utility associated with the behavior, B, at time t, the 
current simulation time. 


A Q = U Bt ,VaGA (48) 

In the CG model, the attitude toward a given behavior is drawn from a set of belief 
and issue stances or strengths that are updated dynamically over the course of a simulation 
run. The formulation shown below then is equivalent to a multi-attribute utility function 
that considers the agents internal state, A, the opinion of others, SN, and the external reward 
or success history, PBC, associated with a behavioral action selection. 
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(3 a = Wi A a + w 2 SN a + w 3 PBC q , Va e A 


(49) 


i=3 

s.t. y~ Wj = 1.0 (50) 

i=1 

This formulation was used in support of current operations analysis of the Kandahar 
province of Afghanistan and is the subject of an upcoming journal article (S. Papadopoulos 
et al., 2013). Results of simple case study analysis verify that the RL code behaves as 
predicted. In this case, the case study involved a small experiment that controlled the level 
of water at each of the servers in a provinces. As the level of water was reduced in one 
area, rather than simply continue to execute the same action despite a negative outcome, 
as in previous versions of the model, the agents adapt their behavior and try alternatives, 
eventually identifying which servers still have water. While this may sound simple, the use 
of agents that actually respond to the dynamics of the environment greatly increased the 
face validity of the model output. The use of RL also reduced the scenario construction 
time and data requirements while reducing runtime over previous model versions. 

3. Incorporation of a Cognitive Architecture 

The cognitive architecture, implemented by Rolands and Associates, was incor¬ 
porated into the CG model and used in support of the FY11 Irregular Warfare Tactical 
Wargame (IW TWG). During the war-game the CG model is used in a human in the loop 
fashion, with human players making action selections which become inputs to the model¬ 
ing suite. The simulation runs for a weeks worth of game time and results from a single 
run are provided back to the player. The action space for the population agents in this use 
case is very small and the state space in this implementation was ill-defined. The code 
and the cognitive architecture functioned as expected, controlling the level of information 
provided to each agent through the percept mediator and allowing the agents to form a no¬ 
tion of state and map states to actions. Agent’s that use the cognitive architecture would 
also be well-suited for use as autonomous red or blue forces in a close loop setting or as 
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semi-autonomous red or blue forces in a war game. Should the population agents be pro¬ 
vided a richer action space, they might warrant the use of more sophisticated autonomous 
behavior. Note that the scenario configuration and definition of the reward signal must 
be accomplished for each application of the CG model. The implementation served as the 
centerpiece of the FY11IW TWG facilitating the player stimulus regarding population out¬ 
put which drove the conduct of the war-game. This effort was recognized with the Army 
Modeling and Simulation Office Team Award in 2011. 

Rigorous testing of the integrated code is ongoing at TRAC-MTRY and initial re¬ 
sults verified that the code is functioning properly and that the scenario file was not prop¬ 
erly configured in a number of respects in the previous iteration. The use of cognitive 
architectures to represent goal driven behavior within simulation and the incorporation of 
reinforcement learning to prevent overly brittle solutions continues to be an active area of 
research. 

E. CONCLUSIONS 

In conclusion this chapter documented the development of a practical cognitive ar¬ 
chitecture and the use of RL in the representation of human population agents within an 
agent based social simulation. RL was first used to make decisions for threat forces in a 
closed loop setting and later used to represent the theory of planned behavior to drive deci¬ 
sion making for population agents in a close looped setting. These innovations were used 
to support an Irregular Warfare Tactical Wargame (IW TWG) in 20120 and an analysis of 
the population along the Pakistan-Afghanistan border. Finally, a version of the cognitive 
architecture was incorporated into the CG model and used in support of IW TWG 2011, 
which received a Army Modeling and Simulation award for excellence in analysis. Sce¬ 
nario configuration proved to be challenging primarily due to scenario initialization files 
and limited data availability and the use case. While CG was developed for use in close 
form its use case migrated to a human in the loop war-game, but the design focus remained 
on the close form case rather than the stimulation of human players within a game-based 
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setting. While the models, including the cognitive architecture, functioned as intended, the 
signal provided to the players was noisy due to lack of calibration. This issue was identi¬ 
fied in after action reviews and strategies to correct his have been implemented by the team 
along with a refocus of model development to a human in the loop use case. 

The use of RL techniques in this area in conjunction with a practical cognitive 
architecture or other construct to maintain domain knowledge provides a conceptual model 
for human behavior that has the potential to be validated and that represents the essential 
elements of human information processing. 
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VI. CONCLUSION 


This chapter provides a summary of the contributions of this dissertation and direc¬ 
tions for future research in this area. 

A. SUMMARY OF CONTRIBUTIONS 

This research provided insights into the application of model-free reinforcement 
learning algorithms within Department of Defense simulation models. The application of 
these algorithms to several use cases was described: 

• Human behavior representation. 

• Assignment and scheduling of resources. 

• Validation of reward structures within training simulations. 

The research developed a novel reinforcement learning algorithm, Direct-Q Com¬ 
putation, designed to speed reinforcement learning by directly addressing the temporal 
credit assignment problem in reinforcement learning. This algorithm leverages properties 
of maximum likelihood estimation to develop an estimator that minimize bias resulting in 
faster learning in sequential decision making tasks with delayed reward when compared 
with other model-free approaches. 

1. Direct-Q Computation 

The primary contribution of this research is the novel use of the exponentially 
weighted average reward as an action-value estimator in reinforcement learning systems 
in order to address the temporal credit assignment problem in reinforcement learning. This 
relatively straightforward approach improves learning speed over dominant existing ap¬ 
proaches in task environments with noisy and delayed reward signals and improves perfor¬ 
mance in non-stationary environments, while reducing the number of parameters required 
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to be specified by the system designer from 3 to 2. Many real-world applications fall into 
this category of problem and in these cases delays in learning or recovery can result in 
control system failures and lost resources. The results provide the modeling and simula¬ 
tion community with a method that speeds learning in these challenging cases, reducing 
the time required to train autonomous software agents and the time required for agents to 
adjust to changes in their environment. These performance results carry over into each of 
the multiple modeling and simulation application areas examined in this research. 

2. Enabling Adaptive Behavior in a Combat Simulation 

A fourth major contribution of this research a methodology and application of rein¬ 
forcement learning to represent human decision-making within a combat simulation. This 
straight-forward approach provides an empirically developed conceptual model of human 
decision making, important for eventual model validation, that facilitates dynamic decision 
making and allows agents to learn from interaction with their environment. This approach 
incorporates the novel use of reinforcement learning within hierarchical task networks, pro¬ 
viding the potential to enable adaptive decision making within complex behaviors. This has 
particular relevance for enabling agents that adapt to the behavior of an opposing force, as 
human decision makers do, as opposed to relatively brittle scripted methods currently in 
use. 


3. Maximizing the Value of a UAV Schedule from a DES 

RL algorithms provide an alternative approach to dynamic programming techniques 
currently employed for scheduling and assignment of resources in TRAC’s assignment and 
scheduling tool. This simulation is used to produce a feasible schedule for unmanned aerial 
assets that maximizes a value function by correctly pairing platforms with mission de¬ 
mands in the context of a combat scenario. The current approach employs a linear program 
that maximizes value over a finite-time horizon, but fails to provide a feasible schedule 
that maximizes value in cases where high value targets appear beyond the time horizon or 
where high value emergent targets become available following the initial allocation. Fur- 
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ther, this approach fails to learn to recognize the cues leading to these situations, as a human 
decision maker would over-time, and makes these mistakes consistently resulting in fea¬ 
sible, but non-value maximizing schedules. The value of the feasible schedule produced 
in different combat simulations for a given mix of unmanned platforms is used to inform 
acquisition decisions regarding unmanned assets, so the current tools limitations directly 
impact the representation of the value of a given mix to senior decision makers. Since 
the difference in the value lost to these cases varies across mixes, the analyst cannot know 
how this systematic issue effects results in the aggregate.This research demonstrates the 
use of reinforcement learning to address these cases and an approach that relaxes the strict 
requirement for fully observable demands currently imposed on the simulation. 

4. Verification of Reward Structure in Training Simulation 

A major contribution of this research is a novel methodology and example applica¬ 
tion of the use of reinforcement learning as a means of verifying the reward structure of a 
training simulation. The reward structure in a training simulation directly impacts trainee 
learning-time and outcomes. A weak reward signal will result in slower learning and a re¬ 
ward signal that rewards trainee actions that are not consistent with learning objectives will 
result in learning the trainee learning the wrong objectives. This research demonstrates the 
use of reinforcement learning to examine the reward structure and produce an example of 
the learned behavior, or policy, that can provide the training simulation designer feedback 
on the student behaviors rewarded by the training simulation prior to the simulation ever 
touching student hands, allowing the developer the opportunity to identify and correct de¬ 
ficiencies prior to fielding. This research contributes a repeatable methodology for the use 
of RL in this use case. 

5. Practical Cognitive Architecture 

RL algorithms used in conjunction with cognitive architectures provide a traceable 
means enabling autonomous behavior while representing human decision making in DoD 
simulation models. The use of these methods within social simulations, as suggested by the 


171 



NRC and others, provided increased transparency when implemented within the CG model 
and significantly reduced data development and improved run-time over previous methods, 
while producing similar results. A fifth contribution of this research is the development 
and application of a novel practical cognitive architecture that facilitates the representation 
of human information processing and the inclusion of domain knowledge in a structured 
manner that enables the selective use of goal-driven reinforcement learning to represent 
human decision making. The cognitive architecture provides an understandable framework 
to incorporate the effects of perception, working memory, and dynamic goal-setting within 
simulation agents. This is particularly relevant for analysis topics related to the value of 
information or the impact of networked sensors. The cognitive architecture also has rele¬ 
vance to the representation of civilian behavior in conflict areas, where the analysis focuses 
on the beliefs and interests of a population and the cognitive architecture provides a or¬ 
ganizing construct. This contribution was incorporated into a social simulation used to 
facilitate war-games that received a 2011 Army Modeling and Simulation Office award for 
excellence in analysis. 

B. FUTURE RESEARCH 

Future research could extend the results of this dissertation by incorporating tech¬ 
niques to dynamically adjust the ratio between exploration and exploitation and by integrat¬ 
ing the cognitive architecture within other simulation models, such as COMBATXXI. Bal¬ 
ancing exploration and exploitation is critical to the performance of reinforcement learning 
systems and is a problem encountered across multiple disciplines. Future work in this area 
is vitally important to both improving agent performance and to understanding how humans 
accomplish this challenging task. 

Perception of state is essential to decision-making and learning, but the state rep¬ 
resentation must remain as sparse as possible to speed learning in reinforcement learning 
systems. Developing methodologies to identify the key components of the environmental 
state in different decision situations is key to understanding how decision-makers reason in 
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uncertain environments and the information required to represent at the agent-level in order 
to facilitate more realistic representations of human behavior in DoD simulation models. 

The application of reinforcement learning methods to facilitate sequential design of 
experiments is an interesting application area not explored in this research, but of interest 
to the modeling and simulation community. In this research we discuss the application of 
reinforcement learning to one aspect of training simulation, but a separate application area 
could involve the use of reinforcement learning agents to drive the pace of instruction based 
on feedback from the trainee, serving as intelligent tutors. 

In this research we have sought to demonstrate the broad applicability of reinforce¬ 
ment learning within DoD models and simulations, but there is still much work to be done 
and to realize the full potential of these simple yet powerful techniques within defense 
modeling and simulation applications. 
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APPENDIX A: FIRST-VISIT AND EVERY-VISIT MONTE-CARLO 


Algorithm 11 First-visit MC (Szepesvari, 2010) 

1: FirstVisitMC(T, V, n) 

2: Parameters: t = (s 0 ,r 1 ,...,r T ,s T )isa trajectory with s T being an absorbing state, n 
is the number of times the long-term value estimate, V, was previously updated, y is 
the discount rate, oc is the learning rate, and t is the current trial. 

3: sunu—0 

4: for t = 0 to T — 1 do 
5: sum -t— sum + y t r t+ i 

6: end for 

7: V + l(sum-V) 

8: return V 


Algorithm 12 Every-visit MC (Szepesvari, 2010) 

1: EveryVisitMC(so,Ti,Si,r 2 ,s T -i,r T , V) 

2: Parameters: s t is the state at turn t, r t+ i is the reward associated with the t tH transition, 
T is the length of the episode, V is the array storing the current value function estimate, 
y is the discount rate, <x is the learning rate. 

3: surm— 0 

4: for t <(— T — 1 down to 0 do 
5: sum-(— tr t+ i + y sum 

6: 5(s t ) sum 

7: V(s t )<-V(s t ) + <x(6(s t )-V(S t )) 

8: end for 
9: return V 
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Algorithm 13 TD(A) (Szepesvari, 2010; Sutton & Barto, 1998) 

1: TD(A) 

2: Parameters: s, the current state; V(s), current value estimate of current state; e(s), 
current eligibility to receive credit of current state; y, discount rate; a, learning rate; A, 
decay rate. 

3: Initiali z e n (ie. e — greedy, Boltzmann). 

4: Initiali z e V(s) arbitrarily, e(s) = 0 for all (s). 

5: Return a using 7t(s). 

6: Take action a, observe reward r, and next state, s'. 

7: 6 <— r + yV(s') — V(s). 

8: e(s) e(s) + 1. 

9: for For all s: do 
10: V(s) •(— V(s) + a6e(s) 

11: e(s) •(— yAe(s) 

12: end for 
13: S i — S ; 

14: Increment t. If t ^ T, go to line 3. 
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APPENDIX B: COMPARISON OF e-GREEDY AND BOLTZMANN 


EXPLORATION 


It is useful to provide a more detailed explanation e-greedy and Boltzmann explo¬ 
ration policies in order understand the true nature of their impact on RL performance when 
coupled with an action-value estimator. The choice of exploration policy is a significant 
factor in the performance of the overall learning system by all metrics, which is intuitive 
since the exploration policy essentially controls the sampling strategy of the learning sys¬ 
tem. Assume two independent learning agents are operating in an MDP as defined previ¬ 
ously. For simplicity let it be a single-state MPD, S = (sq) , with a finite action set consist¬ 
ing of only two actions, A = (do, ai), and that both agents are using the same action-value 
estimation algorithm and are provided with exactly the same estimates, Qfs, a). Assume 
one agent employs an epsilon-greedy policy, with fixed e , and the other agent employs a 
Boltzmann exploration policy, with fixed t. Let the true value of each action be qo and q i 
respectively and let the true value plus estimation error be, 

Q(s 0 , 0-0) = qo + 60) 

Q(s 0 ,ai) = qi+ei (51) 

, for each action choice with q o > qi . To simplify notation for this single state case we 
will refer to Q(so, do) and Q(so, ai ) as Qo and Qi. 

Case 1. Let the estimation error associated with each estimate be equal, e 0 = e 1 
and of the same sign. In this case since the estimation errors are equal we see that Q 0 > 
Qi . Then we see that epsilon-greedy will always choose Qo with probability e and Qi 
with probability 1 — e. So the expected number of exploratory actions is eT , where T is 
the number of trials in the episode. In the case of the Boltzmann exploration policy the 
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probability of selecting each action is given by, 


P(7t(s 0 ) = Clo) = 


exp 


P(7t(s 0 ) = a,) = 


L. 

a exp (J f- 

exp 

qo+eo 

T 

Z Q exp^±^ 

exp 

L, 

i exp Qf- 

exp 

qi+ei 

T 


Za eX P 


Q a~|-6a 
T 


(52) 


Since we know that q 0 > qi and that e 0 = e we see that, q °+ e ° > qi ^ ei for fixed t , 
but this does not imply that P(7t(so) = a 0 ) > P7t(si) = ai) since we have not specified 
enough information regarding t. Dropping the error terms since they are equal, we know 
that the probability of selecting either action will be equal when, 


qo qi 

exp — = exp — 

T T 

qo = cp 

T T 

qp-qi _ 0 

T 


So we see that since the numerator in this case is a positive constant, since q 0 > q i, that the 
only way to satisfy the condition for equal probabilities is as the denominator goes toward 
infinity. 

lim q °~ qi 0 (55) 

T—>oo T 


The size of t required for this condition to hold for our case depends on the magnitude of 
the interval q 0 — q i • Recall that for the exponential function, 


(53) 

(54) 


lim exp x—)■ 1 (56) 

x—>0 
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and that for our case we see that for sufficiently large that, 


P(7t(s 0 ) = Clo) P(7t(s 0 ) = CM) 


and the probability of taking an exploratory action in general is, 


P(7t(s 0 ) 


exp ^r 


ai 


(57) 


(58) 


and that the probability of taking an exploratory action with the Boltzmann policy and the 
epsilon-greedy policy will be the same when, 


e = 


exp 41- 


exp 4? + exp 4 1 


which we can restate as, 


1 


e = 


1 + exp 4o - 4i 


let qo — q i = 6q , and we can say, 


1 


e = 


1 + exp ^4 
e ( 1 + exp —^ = 1 

sq , 

e exp — = 1 — e 

T 


T = 


6q 


ln(l — e) — ln(e) 


(59) 


(60) 


(61) 


This defines the temperature of the Boltzmann strategy in terms of epsilon and the differ¬ 
ence between the action values for the case of equal errors in the estimate, both in the same 
direction. Note that this definition fails for the case where 1 — e = e, and that e € (0,1). 
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As e —> 1, the denominator goes toward negative infinity and as e —> 0, the denominator 
grows toward positive infinity, and in fact the function is symmetric about the fixed point 
at e = 0.5. With fixed numerator this yields a symmetric function for the temperature, pro¬ 
ducing positive temperature parameter values for low exploration probabilities and negative 
values when the probability of exploring is greater than 0.5. Setting the temperature param¬ 
eter using this formulation would ensure that the probability of taking an exploratory action 
using a Boltzmann policy was equal to the probability of taking an exploratory action using 
the epsilon greedy policy and the expected number of exploratory actions using a Boltz¬ 
mann strategy with its temperature parameter set in accordance with the formula above 
would equal eT making it equivalent to the epsilon-greedy case. The insight this provides 
us is that the difference between the action-value estimates is important for understanding 
the behavior of the Boltzmann policy as it is typically employed and the importance to the 
Boltzmann policy of being coupled with value estimation algorithms that minimize error 
or at least treat it in a systematic manner. We see from, 


1 

1 + exp ^ 


(62) 


that as 5q —» oo that the probability of choosing the exploratory action goes toward zero 
given a fixed temperature parameter. This demonstrates the importance of the accuracy 
of the value estimator in the behavior of Boltzmann exploration and we see that when the 
difference between the value of the best action and the next best action are small that the 
probability of choosing an exploratory action will be relatively higher. 

Case 2. Let the estimation errors be unequal such that, e 0 < e-\ . Recall that 
q 0 > qi shows the relationship between the true values of the actions for our single state 
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MDP and that the value of the state action pair as perceived by the agent is, 

(63) 

(64) 

We begin again with the epsilon-greedy case, which will choose the action with the maxi¬ 
mum perceived value with probability 1 — e. In order for the epsilon approach to choose 
the correct action, the values must satisfy, 


Q(s 0 , a o) — qo + 6o 
Q(so, a i) — qi + e i 


9o + eo 

> 

qi + ei 


qo-qi 

> 

ei — eo 

(65) 

qo 


— > i 

(66) 

ei 

— 

eo 




£>1 

(67) 


,so that the difference between the true values are greater than the difference between the 
errors associated with the estimates. The effect on the Boltzmann strategy is seen on the 
probability of choosing the exploratory action below, 


P(7t(s 0 ) 


a 1 ) 


exp ^ 

Ia eX P^ 

exp4i±n 
Y a exp Ta±£a 


( 68 ) 


and from our previous case we saw that this could be simplified to, 

1 

6 1 + exp Y ~ - 9 r 


(69) 
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and that, 


61 = (q 0 - qi) -+(e 0 - ei) 


(70) 


Since e 0 < and fio > qi for this case, we see that the error term is negative and serves to 
reduce the magnitude of 6q , which has the effect of increasing the likelihood of exploration 
since the difference between the two estimates is reduced for fixed temperature. Depending 
on the size of the error this could cause the exploration policy to take more exploratory 
steps, despite an accurate estimate of the better action, to improve its estimate, reducing 
overall system performance. We see then that any error in action-value estimation that 
serves to contract 5q will induce exploratory behavior, while error that serves to increase 
this interval will induce more greedy behavior. Note that this case is often encountered in 
practice since it is common that the best action will be sampled more often resulting in a 
lower approximation error than actions sampled less often. 
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APPENDIX C: BAYESIAN OPTIMAL POLICY FOR N-ARM 


BANDIT 


The total regret and regret per turn have been used to gage the performance of 
learning algorithms in the bandit problem previously. They make use of the expected utility 
of choosing the best alternative at each turn in comparison to the expected utility of the 
chosen path. As a means of gaining further insight into the specific problem of when to 
explore we expand the state tree for the 2-arm bandit problem and solve for the sequence 
of choices that lead to the best solution out to depth 10. The result is a set of nodes whose 
state is a tuple consisting of (n Q , k a , n b , k b ), where n is the number of times that an arm, 
a or b, has been chosen, and k is the number of times the arm has paid off. The first node 
in the tree then is the tuple (0, 0, 0,0). Children are added to the parent nodes reflecting all 
states that are possible to reach from that node branching by a factor of four at each node 
like, 

(ft aT 1, k a , Tib> kb)) (ftaT 1 > k a T 1 y Tib) kb) ? (fla> k a , TibT 1 > kb)> (flcn k a , TibT 1 > kbT 1 ) 

(71) 

The probability of visiting a state from the current state is calculated using the 
probability of success of each arm and a particular set of hyper-parameters, (a, [3, w), set 
on initialization. The hyper-parameters are used to determine the probability of visiting any 
node, so the probability of transitioning from the root node to the child state n a = 1, k a = 
0, rib = 0, k b = 0 is the probability of choosing A, w initially, time (1 — a), the probability 
of not receiving a payoff after choosing A. The expected utility of each node is determined 
by either summing up the k hits for the parent node or determining the expected utility of the 
child nodes with the maximum expected utility of the child becoming the expected utility 
of the parent. The result is a tree that is completely expanded out to the specified depth with 
the expected utilities then rolled back up to the root node and the optimal choices of lever 
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identified at each depth. The resulting policy is harvested with an additional parameter 
added to the tuple reflecting the optimal choice at each state, (n a , k Q , n b , k b , c). 

Using this state information we can determine the points on the optimal path at 
which an exploratory action is taken from the perspective of an algorithm that does not 
know the underlying distributions of the arms, but is only relying on the observations con¬ 
tained in the original form of the tuple in order to leam the hyper-parameters. An ex¬ 
ploratory action is defined as one in which the optimal choice, c, from a state transitions to 
the arm with the lower probability of success, min (p Cl , p b ), where, 

Pa = —andp b = — (72) 

n a n b 

,Vn a , n b > 0 and 0 otherwise.This approach provides the full set of nodes that belong to 
the optimal policy, o G P, and the set of nodes that belong to the exploratory set, e G E, 
that are also members of the set of nodes in the optimal policy for a given A, a, |3 out to the 
specified depth. 

Using this empirical data it is possible to empirically determine a policy, f (n a , k a , rt b , k b ), 
to determine when it is optimal to explore. In order for this policy to be general and useful 
when the underlying distribution is not known requires that the policy be based on empiri¬ 
cal data drawn from multiple combinations of A, a, (3 . The added benefit of this approach, 
and the original motivation for the work, is that a tight bound on the expected discounted 
utility is produced that can then be used for comparison to empirical results using combi¬ 
nations of action-value functions and exploration strategies. Initial results of this approach 
applied to four combinations of a and |3, (|^, ||), |), (||, ||), (|, |), across a range 

of A, {0.1,0.2,..., 1.0}. This resulted in a set P consisting of 818 nodes and a set E consist¬ 
ing of 128 nodes, with w=0.5. No differences in optimal policy membership were observed 
as a result of variations in A for the same distribution. Differences in optimal policy mem¬ 
bership were observed between combinations of oc and (3, with the differences based on the 
magnitude of a — (3. The optimal choice of arm for all states when priors are known is 
obtained by updating the prior probability based on current observations. 
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P(a, |3|k a ,k b )P(k a ,k b ) = P(k a ,k b |, |3)P(a, (3) 


(73) 


P(a> (3|k a , k b ) oc P(k Q ,k b |, |3)P(a, (3) 


(74) 


and if P(a, (3) = P(|3, a), then, 


P(a, (3|k a ,k b ) oc P(k a ,k b |, (3) 


(75) 


and, 


P(ct, (3) oc f na V ka (1 — cx) na ~ ka (3 kb (a — (3) 
\k a J V k b / 


n b -l<b 


a k 4a-ar- kb 


(76) 


(77) 


If, for arms with equal rewards, we assume that one arm is always preferable, (3 — 
oc > 0, then the problem is to determine the perceived location of the best arm, [3. So when, 


P(CX, (3 k Q , k b ) 

< P(|3, a|k a , k b ), chooseA 

(78) 

P(a, (3 k a ,k b ) 

> P(|3,ak a ,k b ),chooseB 

(79) 

P(cx, |3|k a ,k b ) 

= P(|3, a|k a , k b ), chooseA 

(80) 


Using these rules we correctly select the choice made at each of the nodes in our 
empirical set P. Using this information we can compare the decisions made by reinforce¬ 
ment learning algorithms to the optimal policy. This result provides us with a means of 
identifying the best arm out of the k arms when the priors are known. When the priors 
are unknown we need only know that there exists an arm, k*, such that its probability of 
success, oc £, is greater than the probability of success of all other arms, otj, > «| C . The 
probability of the best arm being located in each of the k positions and the choice is the 
kth location with the highest probability, with arbitrary tie breaking. An examination of 
the member nodes of E results in observations useful to characterize the optimal time to 
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explore, remembering that the decision to explore is the result of an update that resulted in 
a false belief in the location of the optimal arm. 
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APPENDIX D: ANALYSIS OF DQ-C AND TD(A) 


Relating DQ-C to TD(A) 

TD(A) serves as a foundational method bridging the gap between MC and TD tech¬ 
niques through the use of e(s) to estimate V(s) (Sutton & Barto, 1998). MC methods use 
the full sample return as the estimate, 

L—1 

R s MC = Z^ rt +i’ (8!) 

i=0 

where L is the number of state transitions in after time t, while TD methods estimate the 
return by using the previous estimate, 

Rj t D = r t + V(s t +i )> (82) 

where Rj t D is the estimate from s t and r t is the reward received going from s t to s t +i- 
These two estimates are examples of the more generalized n-step return, where for n ^ 1, 

R^ 0 = r t + yr t+1 + y 2 r t+2 + 

•••+ y n ^ 1 r t +n-i +y n V(s t+n ), (83) 

which uses the observations of n transitions and the previous estimate provided by the value 
function. TD(A) uses all n-step returns simultaneously by incorporating a new parameter, 
A G [0,1] to provide the A-return, 

OO 

R* = (1 -A) Y_ A n R^ +1) . (84) 

n=0 
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The A-return algorithm uses the A-return to determine the increment to V(s t ) for each step, 


V(s t ) <-V(s t ) + <x[R*-V t (s t )], (85) 

with the update only occurring to s = s t . 

Pausing here we can see that Uj (s,a) from DQ-C most closely compares to R^ c , 
differing in the manner in which point rewards are accrued, with DQ-C crediting each (s, a) 
pair with all future point rewards, indexing the sum on the number of point rewards, k, 
while R^ c credits only s and indexes on the number of state transitions in the episode.The 
two approaches also differ in the manner in which y is decayed, using the time between the 
visit to a given (s, a) in DQ-C versus the transition index in TD(A). 

k 

Uj(s,a) = Y < tj), 

i=l 

L—1 

( 86 ) 

i=0 

Uj(s, a) clearly contrasts with Rj°, which bootstraps off its previous estimate. Since 
this forms the basis of R)' 1 11 which lies at the heart of the A-retum we see that DQ-C 
is distinctly different than the ’’forward view” of TD(A). The manner in which Q(s, a) is 
calculated in DQ-C differs from the A-return as well, with the index of the summation in 
DQ-C being tied to visits to the (s, a) rather than steps forward in time from a given state. 

The ’’backward view” of TD(A) is typically used in implementation and increments 
e(s) for each visit to a (s), see Algorithm 14. By contrast, DQ-C maintains an estimate 
for the value of each visit to a (s, a), directly assigning credit to each visit as point rewards 
are obtained in order to update the expected utility obtained from each visit. 
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Algorithm 14 TD(A) 

1: Initialize n (ie. e — greedy, (3). 

2: Initiali z e V(s) arbitrarily, e(s) = 0 for all (s). 

3: Return a using 7t(s). 

4: Take action a, observe reward r, and next state, s'. 
5: 5 r + yV(s') — V(s). 

6: e(s) i — e(s) T 1. 

7: for For all s: do 
8: V(s) V(s) + a6e(s) 

9: e(s)^—yAe(s) 

10: end for 
11: S i — s' 

12: Increment t. If t ^ T, go to line 3. 


TD(A) then makes use of a learning rate parameter to incrementally leam from the 
one-step return standard to TD methods, while DQ-C recursively updates either a sample 
or exponentially weighted average. TD(A) requires three parameters: a discount factor, y, 
a learning rate, a, and a parameter to control the length of the backup, A. 

DQ-C requires only the use of a discount factor, y, in the sample average case, and 
adds a learning rate or base weight, a, that serves a similar purpose as A in TD(A) in the 
exponentially weighted case. Several potential strategies are yet to be explored to elimi¬ 
nate the need for a in the exponential version of DQ-C. The consideration of time in the 
application of weights in the exponentially weighted version of the algorithm also serves 
to distinguish DQ-C from other approaches based on TD(A), such as Q(A) and SARSA(A). 
The next section examines the convergence properties of DQ-C. 

Analysis of DQ-C 

In this section we will analyze the convergence properties of DQ-C in a stationary 
episodic task environment. 
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Definition 1. Let {a n } be a sequence of real numbers. We say that {a n } is Cauchy con¬ 
vergent provided for every e > 0, there is an N(e) such that, n > Nfe) and m > Nfe) 
implies |a n — a m | <e. 

Lemma 1. Given an episodic task environment, a stochastic Tt, and an infinite number of 
trials, t —>• oo, the number of visits, n, to each (s, ci) will go to oo as will the number of k 
point rewards. 

Assume DQ-C is paired with a stochastic 7t. This guarantees that as t —>■ oo, 
n —>■ oo for each (s, a) and since Rfs) assigns point rewards based on visits to s we can 
see that k —>■ oo as well. We also know by that y tl_t > forms a geometric series since 
ft > tjVi G k for each j, and that as k —> oo the geometric series y ti-t s converges to 0. 

Theorem 1. The estimate of the utility of each j th visit to a (s, a), lij(k), converges as 
t —y oo. 

Assuming constant non-negative rewards, we see that |Uj (i— 1) —Uj (i) | > |Uj(i) — 
Uj(i + 1)|Vi G k and that as k —» oo, |Uj(k) — Uj(k + 1 )| —y 0, since with fixed tj, 
fk+t — tj —* oo and y°° —> 0, and we can see that there exists a N(e) such that |Uj(i + 
n) — Uj(i + m)| < e,Vn,m > N(e). 

Theorem 2. The estimate of the sample average of each (s, a), Qfn, k), converges to the 
true mean value of a visit to (s, a), \x, as t—y oo. 

Q(n, k) can be viewed as the sample mean of a convergent series of partial sums 
by examining the discussion of Uj (k) and we can leverage the convergent properties of 
Uj(k) to show that the right hand side of the one-step update of Q(n, k) goes to zero 
as n, k —y oo, since lim n)lc _ J . 0O y tk +! _t i = 0, and since lim^oo = 1 we see that 
lim n , k ^oo^yrQ(n,k) = Q(n,k). 

Theorem 3. The estimate of the exponentially weighted average of each (s, a), Qfn, k), 
converges to the recency weighted mean of a visit to (s, a), p.(n, k, t), as t, n, k —>■ oo. 
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The exponentially weighted average provides an estimate of the current value of a 
(s, a) by ensuring that as the t —>• oo that with fixed tj, (t — tj) —> oo and that the weight, 
a t_t s —* 0, for 0 < a < 1. We can see that since the weights are normalized that as 
initial weights tend toward 0, greater emphasis is placed on new observations and that for 
|Q(n, k, t) — Q(n, k, t + 1 )| < e as t —>• oo for fixed n, k and that a t-tn < a t_tn +'Vj G 
n. We see that as the interval t — t n —> oo that the estimate converges to the n + 1 th 
observation. 


lim Q(n, k, t) 

t,n,k—»oo 


^Sfrijkj + a 1 WiUn+^k) 


odCfn) + a t ~ tn +' 


<x 


t t n +i U n _|_i (k) 


oc 


t-t n+1 


= U 


n+l 


(k) 


(87) 
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APPENDIX E: PTSP MAPS 


Map sets for the Physical Traveling Salesman Problem. 



Figure 45: Map 2 for physical traveling salesman problem. 
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Figure 46: Map 3 for physical traveling salesman problem. 
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Figure 47: Map 4 for physical traveling salesman problem. 
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Figure 48: Map 5 for physical traveling salesman problem. 
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Figure 49: Map 6 for physical traveling salesman problem. 
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Figure 50: Map 7 for physical traveling salesman problem. 
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Figure 51: Map 8 for physical traveling salesman problem. 
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Figure 52: Map 9 for physical traveling salesman problem. 
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Figure 53: Map 10 for physical traveling salesman problem. 
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APPENDIX F: FULL ASC-U FORMULATION 


ASC-U formulates this problem as a finite state and finite time horizon dynamic 
program with the following indices (Ahner et al., 2006), 

• t, current time, t = 0,1, ...T, where T maximum simulation time, 

• i, i tH UAV, i G UAV, where UAV is the set of UAV, 

• j, f h GCS, j G GCS, where GCS is the set of GCS, 

• k, k th LRS, k G LRS, where LRS is the set of LRS, 

• l, I th Mission, l G Msn, where Msn is the set of Missions, 

and the following additional terms, 

• t, the time horizon over which the system is optimized, 

• (x, y ) t , the cartesian coordinate of an object at t, 

• tufght’ ti me °f flight of i tH UAV at t 

• Ul. a decision vector that acts on the system selected from the finite set U at each t, 

• st, the vector describing the state of the system at t, where st G S, 

• s t +t, the state of the system at the end of the next time horizon, 

used in the dynamic programming formulation where the update to s t +t is further 
defined, 

it+t = ft (it>ut,T) (88) 

and it is further defined. 
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UA 


vt 


-> = ( UAVi GCSj LRS k MsnA 
5 t — I > 5 t y > b t j 1 


'xv) UAVi t UAVi t UAVi A UAVi o-nc^AVi timpUAVi 

\ A ') y/t y L fligh.t> L recovery y / '-t 
GCS 


where, 


l , sensor^ l ,type 


Vj Ol / \ G G Si • g Vj 

s t = (x,y)t y capacity t \type 


G CSi 


,GCSi 


,LRS k _ 


s t "^= (x,y) t k ,capacitp t k ,type 


LRSi 


->• 


,Msni _ 




Msni 


(89) 


sensor m = f sensorTypem., (t 0 pen, t e ndOpen), v r ate ), (90) 


describes the seor requirement of each mission area. 

In the dynamic programming formulation the cost-to-go or future value function at 
time t is used to determine the optimal policy, here mapping 7t(s) —» U, 


Jt(st) = max C t (s t ,Ut) + J t+T (st+ T ), (91) 

for t = 0...T — T f , where T f is the time difference from the end of the time horizon 
and the last applied control. Since for the deterministic case all mission demands are known 
in advance the optimal control policy could be obtained by, 

u t = maxC t (s t ,u£) + J t+T (s t+T ), (92) 

Ut 

The problem as addressed in ASC-U is formulated using the following indices 
(Ahner et al., 2006), 
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• t, current time, t = 0,1, ...T, where T maximum simulation time, 

• i, i tH UAV, i G UAV, where UAV is the set of UAV, 

• j, j th GCS, j G GCS, where GCS is the set of GCS, 

• k, k th LRS, k G LRS, where LRS is the set of LRS, 

• l, I th Mission, l G Msn, where Msn is the set of Missions, 
and the following sets, 

• A, set of all mission areas with active missions during the optimization time hori¬ 
zon, 

• L, set of all active LRS, 

• G, set of all GCS, 

• Gl, set of all GCS assigned to LRS L, 

• C g , number of UAV GCS g is capable of controlling, 

• I t , sub-set of UAV at LRS 1 G L, defined by min f ready UAV, launch limit, airborne 
UAV,2 I Gl C g m assigned UAV to GCS 

• Jt, subset of all sensor packages located at LRS, 1 G L, 

• Y ga = 1, if mission area a assigned to GCS g, 0 otherwise, 

• Cj a , reward for a UAV with sensor package j being assigned to mission area a from 
the soonest possible arrival time of the UAV at a to the end of the time horizon, 
t + St 1 , for UAVt, 

• Xj a = 1, if a UAV with sensor package j is assigned to a and 0 otherwise. 
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The formulation for this revised problem seeks to maximize the value of the mission 


areas covered subject to four constraints, 


max y~ CjgXjg 


subject to, 

^X jQ ^ IVa G A 

j 

, ensures only a single UAV to a mission area, 


(93) 


(94) 


^X jQ ^ IViG I 

a 

, ensures only a single mission area is assigned to each UAV, 


(95) 


^V ga X jQ + gVg e G (96) 

ja 

, constrains the number of UAV’s a single GCS can control and, 


ZXi^HilVlEL (97) 

jejla 

, ensures that the sensors assigned to an area does not exceed the capacity of the assigned 
UAVs. 

The assignment of mission areas to GCS is determined by a heuristic, see Algo¬ 
rithm 15, as is the determination of the reward, see Algorithms 16. This reformulation 
proved successful in application and was successfully applied to several real-world studies 
by both TRAC and MCCDC-OAD. We discuss the application of RL to this problem in the 
following section. 
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Algorithm 15 Assignment of Mission Areas to GCS. 

1: for each LRS and mission area a do 

2: N a = the number of GCS g that are in range of UAV assigned to a. 

3: end for 

4: for each mission area a do 
5: sort by N a . 

6: for each GCSg do 

7: sort by a Yg a 

8: If a is in range of g, let Y ga = 1 

9: end for 


Algorithm 16 Assignment of value for completed missions. 

1: for each UAV i do 

2: for each mission area a do 

3: to = first time after the earliest arrival time that UAV i can gain value by being 

assigned to mission area a 

4: ti = min (the latest time UAV i can remain at mission area a, t + St 1 , for UAV i) 

5: K a = the set of all missions located at mission area a 

6: Vi ) k,t 0 ,tj = the value that UAV i gains from mission k by being at mission area a 

7: Calculate, Cia = 2I ka V i)k)to , tl 

8 : end for 

9: end for 
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APPENDIX G: LEARNED STATE-ACTION VALUES FOR 


URBANSIM. 


Table 20: Learned policy for Battalion Commander by turn for 15 turn UrbanSim game 
using DQ-C. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.6E-316 

6.6E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SKH 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.5E+02 

SKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SNC 

2.4E+02 

2.4E+02 

2.3E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

SNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SNB 

2.3E+02 

2.6E+02 

2.3E+02 

2.4E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.5E+02 

MKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MKH 

2.6E+02 

2.3E+02 

2.6E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.3E+02 

MKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MNC 

2.4E+02 

2.4E+02 

2.4E+02 

2.6E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

MNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MNB 

O.Oe+OO 

9 

2.4E+02 

10 

2.3E+02 

11 

2.3E+02 

12 

2.4E+02 

13 

2.4E+02 

14 

2.4E+02 

15 

2.4E+02 

SKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SKH 

2.4E+02 

2.5E+02 

2.5E+02 

2.6E+02 

2.5E+02 

2.6E+02 

2.6E+02 


SKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SNC 

2.5E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.6E+02 

2.6E+02 

3.0E+02 


SNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SNB 

2.4E+02 

2.5E+02 

2.6E+02 

2.9E+02 

2.5E+02 

2.6E+02 

2.6E+02 


MKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MKH 

2.5E+02 

2.4E+02 

2.9E+02 

2.6E+02 

2.6E+02 

2.6E+02 

2.6E+02 


MKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MNC 

2.5E+02 

2.8E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.6E+02 

2.6E+02 


MNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MNB 

2.5E+02 

2.5E+02 

2.5E+02 

2.5E+02 

2.6E+02 

O.Oe+OO 

2.6E+02 
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Table 21: Learned policy for CA unit by turn for 15 turn UrbanSim game using DQ-C 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.6E-316 

6.6E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SKH 

2.4E+02 

2.3E+02 

2.4E+02 

2.3E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.5E+02 

SKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SNC 

2.6E+02 

2.4E+02 

2.3E+02 

2.4E+02 

2.3E+02 

2.4E+02 

2.7E+02 

2.5E+02 

SNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SNB 

2.4E+02 

2.4E+02 

2.3E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

MKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MKH 

2.3E+02 

2.6E+02 

2.6E+02 

2.4E+02 

2.3E+02 

2.4E+02 

2.3E+02 

2.4E+02 

MKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MNC 

2.3E+02 

2.4E+02 

0.0E+00 

2.4E+02 

2.4E+02 

2.3E+02 

2.4E+02 

2.4E+02 

MNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MNB 

2.4E+02 

2.4E+02 

2.4E+02 

2.3E+02 

2.4E+02 

2.4E+02 

O.Oe+OO 

2.4E+02 


9 

10 

11 

12 

13 

14 

15 


SKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SKH 

2.5E+02 

2.8E+02 

2.6E+02 

2.6E+02 

O.Oe+OO 

2.7E+02 

2.6E+02 


SKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SNC 

2.4E+02 

2.5E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.6E+02 

2.7E+02 


SNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SNB 

2.4E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.6E+02 

2.6E+02 

2.6E+02 


MKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MKH 

2.4E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.5E+02 

3.0E+02 

2.6E+02 


MKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MNC 

2.8E+02 

2.5E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.6E+02 

2.6E+02 


MNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MNB 

2.5E+02 

2.5E+02 

2.5E+02 

2.5E+02 

O.Oe+OO 

2.6E+02 

2.6E+02 
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Table 22: Learned policy for E CO a by turn for 15 turn UrbanSim game using DQ-C 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

4.8E-316 

I.Oe-315 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

SKH 

2.6E+02 

2.4E+02 

2.6E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.3E+02 

2.5E+02 

SKB 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

SNC 

2.4E+02 

2.4E+02 

2.3E+02 

2.3E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

SNH 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

SNB 

2.4E+02 

2.3E+02 

2.4E+02 

2.4E+02 

O.Oe+OO 

2.4E+02 

2.4E+02 

2.4E+02 

MKC 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

MKH 

2.3E+02 

2.4E+02 

2.3E+02 

2.4E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

MKB 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

MNC 

2.4E+02 

2.3E+02 

2.3E+02 

2.4E+02 

2.3E+02 

2.3E+02 

2.7E+02 

2.3E+02 

MNH 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

MNB 

2.4E+02 

2.4E+02 

2.4E+02 

2.3E+02 

2.6E+02 

2.4E+02 

2.4E+02 

2.4E+02 


9 

10 

11 

12 

13 

14 

15 


SKC 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 


SKH 

2.5E+02 

O.Oe+OO 

2.5E+02 

2.6E+02 

2.5E+02 

2.7E+02 

3.0E+02 


SKB 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 


SNC 

2.4E+02 

2.5E+02 

2.5E+02 

2.5E+02 

2.6E+02 

2.6E+02 

O.Oe+OO 


SNH 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 


SNB 

2.4E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.5E+02 

2.6E+02 

2.6E+02 


MKC 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 


MKH 

2.5E+02 

2.5E+02 

2.6E+02 

2.9E+02 

2.6E+02 

2.6E+02 

2.5E+02 


MKB 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 


MNC 

2.5E+02 

2.5E+02 

2.9E+02 

2.6E+02 

2.6E+02 

2.6E+02 

2.6E+02 


MNH 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 


MNB 

2.8E+02 

2.5E+02 

2.6E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.7E+02 
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Table 23: Learned policy for F CO a by turn for 15 turn UrbanSim game using DQ-C 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

4.8E-316 

I.Oe-315 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

SKH 

O.Oe+OO 

2.3E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

SKB 

7.1e-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

SNC 

2.4E+02 

O.Oe+OO 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.7E+02 

2.4E+02 

SNH 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

SNB 

2.6E+02 

2.4E+02 

2.4E+02 

2.3E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.7E+02 

MKC 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

MKH 

2.4E+02 

2.4E+02 

2.3E+02 

2.4E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

MKB 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

MNC 

2.3E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.5E+02 

MNH 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

MNB 

2.4E+02 

2.4E+02 

2.3E+02 

2.3E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 


9 

10 

11 

12 

13 

14 

15 


SKC 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 


SKH 

2.4E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.6E+02 

3.0E+02 

2.6E+02 


SKB 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 


SNC 

2.4E+02 

2.5E+02 

2.5E+02 

2.9E+02 

2.6E+02 

2.6E+02 

2.7E+02 


SNH 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 


SNB 

2.5E+02 

2.5E+02 

2.5E+02 

2.5E+02 

2.6E+02 

2.7E+02 

2.6E+02 


MKC 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 


MKH 

2.5E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.6E+02 

2.6E+02 

2.6E+02 


MKB 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 


MNC 

2.5E+02 

2.5E+02 

2.9E+02 

2.6E+02 

2.9E+02 

O.Oe+OO 

3.0E+02 


MNH 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 


MNB 

2.4E+02 

2.5E+02 

2.5E+02 

O.Oe+OO 

2.5E+02 

2.6E+02 

2.6E+02 
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Table 24: Learned policy for E CO b by turn for 15 turn UrbanSim game using DQ-C 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

4.8E-316 

I.Oe-315 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

SKH 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

SKB 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

SNC 

2.3E+02 

2.3E+02 

2.6E+02 

2.3E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.5E+02 

SNH 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

SNB 

2.6E+02 

2.3E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.6E+02 

2.4E+02 

2.5E+02 

MKC 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

MKH 

2.3E+02 

2.4E+02 

2.3E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.3E+02 

MKB 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

MNC 

2.4E+02 

2.4E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.5E+02 

MNH 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

MNB 

2.4E+02 

2.4E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 


9 

10 

11 

12 

13 

14 

15 


SKC 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 


SKH 

2.5E+02 

2.5E+02 

2.5E+02 

2.9E+02 

2.6E+02 

O.Oe+OO 

2.6E+02 


SKB 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 


SNC 

2.8E+02 

2.8E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.7E+02 

2.6E+02 


SNH 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 


SNB 

2.5E+02 

2.5E+02 

2.9E+02 

2.6E+02 

2.6E+02 

2.6E+02 

2.6E+02 


MKC 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 


MKH 

2.4E+02 

2.5E+02 

2.6E+02 

2.6E+02 

O.Oe+OO 

2.6E+02 

O.Oe+OO 


MKB 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 


MNC 

2.4E+02 

2.5E+02 

2.6E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.6E+02 


MNH 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 


MNB 

2.4E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.6E+02 

2.6E+02 

3.0E+02 
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Table 25: Learned policy for F CO b by turn for 15 turn UrbanSim game using DQ-C 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

4.8E-316 

I.Oe-315 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

SKH 

2.4E+02 

2.6E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

SKB 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

SNC 

2.3E+02 

2.4E+02 

O.Oe+OO 

O.Oe+OO 

2.6E+02 

2.4E+02 

2.4E+02 

2.7E+02 

SNH 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

SNB 

2.4E+02 

2.3E+02 

2.3E+02 

2.3E+02 

2.3E+02 

2.3E+02 

2.4E+02 

2.4E+02 

MKC 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

MKH 

2.4E+02 

2.4E+02 

2.4E+02 

2.3E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.3E+02 

MKB 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

MNC 

2.4E+02 

2.4E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

O.Oe+OO 

MNH 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

MNB 

2.3E+02 

2.4E+02 

2.3E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.5E+02 


9 

10 

11 

12 

13 

14 

15 


SKC 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 


SKH 

2.4E+02 

2.5E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.6E+02 

2.6E+02 


SKB 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 


SNC 

2.8E+02 

2.5E+02 

2.6E+02 

2.9E+02 

2.6E+02 

O.Oe+OO 

2.5E+02 


SNH 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 


SNB 

2.5E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.6E+02 

2.6E+02 

2.6E+02 


MKC 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 


MKH 

2.5E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.6E+02 

2.6E+02 

2.6E+02 


MKB 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 


MNC 

2.5E+02 

2.8E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.7E+02 

2.6E+02 


MNH 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 


MNB 

O.Oe+OO 

2.5E+02 

2.5E+02 

2.5E+02 

2.6E+02 

2.7E+02 

2.6E+02 
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Table 26: Learned policy for G CO a by turn for 15 turn UrbanSim game using DQ-C 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.5E-316 

4.8E-316 

I.Oe-315 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

SKH 

2.4E+02 

2.4E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.3E+02 

2.4E+02 

SKB 

7. Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

SNC 

2.3E+02 

O.Oe+OO 

2.4E+02 

2.3E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.3E+02 

SNH 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

SNB 

2.6E+02 

O.Oe+OO 

2.3E+02 

2.3E+02 

2.3E+02 

2.4E+02 

2.4E+02 

O.Oe+OO 

MKC 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

MKH 

2.4E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

MKB 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

MNC 

2.4E+02 

2.4E+02 

2.6E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.5E+02 

MNH 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

MNB 

2.3E+02 

2.4E+02 

O.Oe+OO 

2.4E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 


9 

10 

11 

12 

13 

14 

15 


SKC 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 


SKH 

2.5E+02 

2.5E+02 

2.5E+02 

2.6E+02 

2.5E+02 

2.6E+02 

2.6E+02 


SKB 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 


SNC 

2.4E+02 

2.5E+02 

O.Oe+OO 

2.6E+02 

2.6E+02 

2.6E+02 

2.6E+02 


SNH 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 


SNB 

2.5E+02 

2.5E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.7E+02 

3.0E+02 


MKC 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 


MKH 

2.5E+02 

O.Oe+OO 

2.6E+02 

2.6E+02 

2.6E+02 

2.6E+02 

2.5E+02 


MKB 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 


MNC 

2.5E+02 

2.5E+02 

2.5E+02 

2.5E+02 

2.6E+02 

3.0E+02 

O.Oe+OO 


MNH 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 


MNB 

2.4E+02 

2.5E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.7E+02 

2.6E+02 
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Table 27: Learned policy for H CO a by turn for 15 turn UrbanSim game using DQ-C 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

4.8E-316 

2.3E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.7E+02 

2.4E+02 

SKH 

2.4E+02 

2.4E+02 

O.Oe+OO 

2.4E+02 

2.6E+02 

2.4E+02 

2.7E+02 

2.3E+02 

SKB 

2.4E+02 

2.4E+02 

O.Oe+OO 

2.4E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.5E+02 

SNC 

2.4E+02 

2.4E+02 

2.3E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

SNH 

2.4E+02 

2.4E+02 

2.4E+02 

2.3E+02 

2.6E+02 

2.4E+02 

2.4E+02 

2.5E+02 

SNB 

2.4E+02 

2.3E+02 

2.3E+02 

2.6E+02 

2.4E+02 

2.4E+02 

O.Oe+OO 

2.7E+02 

MKC 

2.4E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.3E+02 

2.6E+02 

2.4E+02 

2.4E+02 

MKH 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

MKB 

2.3E+02 

2.4E+02 

2.3E+02 

2.4E+02 

O.Oe+OO 

2.3E+02 

2.4E+02 

2.4E+02 

MNC 

2.4E+02 

2.4E+02 

2.3E+02 

O.Oe+OO 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

MNH 

2.4E+02 

2.4E+02 

2.3E+02 

2.6E+02 

O.Oe+OO 

2.4E+02 

2.4E+02 

2.7E+02 

MNB 

2.4E+02 

2.3E+02 

2.3E+02 

O.Oe+OO 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 


9 

10 

11 

12 

13 

14 

15 


SKC 

2.5E+02 

O.Oe+OO 

2.5E+02 

2.9E+02 

2.9E+02 

2.6E+02 

2.7E+02 


SKH 

2.5E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.6E+02 

3.0E+02 

2.6E+02 


SKB 

2.5E+02 

2.5E+02 

2.6E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.6E+02 


SNC 

2.4E+02 

2.5E+02 

2.5E+02 

2.5E+02 

O.Oe+OO 

2.7E+02 

3.0E+02 


SNH 

2.5E+02 

2.5E+02 

2.5E+02 

2.6E+02 

2.6E+02 

3.0E+02 

3.0E+02 


SNB 

2.4E+02 

2.5E+02 

2.6E+02 

2.9E+02 

2.6E+02 

2.6E+02 

2.6E+02 


MKC 

2.4E+02 

2.8E+02 

2.6E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.6E+02 


MKH 

2.5E+02 

2.8E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.6E+02 

2.6E+02 


MKB 

2.5E+02 

2.5E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.7E+02 

2.5E+02 


MNC 

2.5E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.5E+02 

2.7E+02 

2.6E+02 


MNH 

2.4E+02 

2.5E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.6E+02 

2.6E+02 


MNB 

2.5E+02 

2.5E+02 

2.5E+02 

2.5E+02 

2.5E+02 

2.6E+02 

2.5E+02 
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Table 28: Learned policy for G CO b by turn for 15 turn UrbanSim game using DQ-C 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

4.8E-316 

I.Oe-315 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

SKH 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

SKB 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

SNC 

2.3E+02 

2.4E+02 

2.3E+02 

2.4E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

SNH 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

SNB 

2.6E+02 

2.4E+02 

2.6E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

MKC 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

MKH 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

O.Oe+OO 

2.4E+02 

2.5E+02 

MKB 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

MNC 

2.4E+02 

2.6E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.3E+02 

MNH 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

MNB 

2.3E+02 

2.4E+02 

2.3E+02 

2.3E+02 

2.3E+02 

2.4E+02 

2.4E+02 

O.Oe+OO 


9 

10 

11 

12 

13 

14 

15 


SKC 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 


SKH 

2.8E+02 

2.5E+02 

2.5E+02 

2.6E+02 

O.Oe+OO 

2.6E+02 

2.7E+02 


SKB 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 


SNC 

2.5E+02 

2.4E+02 

2.5E+02 

2.6E+02 

2.5E+02 

O.Oe+OO 

2.6E+02 


SNH 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 

7.Ie-316 


SNB 

2.4E+02 

2.5E+02 

2.9E+02 

2.6E+02 

2.6E+02 

2.6E+02 

2.6E+02 


MKC 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 


MKH 

2.5E+02 

2.5E+02 

2.5E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.6E+02 


MKB 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 


MNC 

2.5E+02 

2.5E+02 

2.5E+02 

2.6E+02 

2.5E+02 

O.Oe+OO 

2.6E+02 


MNH 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 


MNB 

2.5E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.6E+02 

2.6E+02 

2.6E+02 
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Table 29: Learned policy for QRF by turn for 15 turn UrbanSim game using DQ-C 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

4.8E-316 

2.3E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.5E+02 

SKH 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

SKB 

2.4E+02 

2.3E+02 

2.4E+02 

O.Oe+OO 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

SNC 

2.3E+02 

2.4E+02 

2.3E+02 

2.4E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

SNH 

2.6E+02 

2.4E+02 

2.4E+02 

2.3E+02 

2.4E+02 

2.3E+02 

2.4E+02 

2.4E+02 

SNB 

2.6E+02 

2.4E+02 

2.6E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

MKC 

2.3E+02 

2.4E+02 

2.3E+02 

2.6E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

MKH 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

O.Oe+OO 

2.4E+02 

2.5E+02 

MKB 

2.4E+02 

2.4E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

MNC 

2.4E+02 

2.6E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.3E+02 

MNH 

2.4E+02 

2.4E+02 

2.6E+02 

2.3E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

MNB 

2.3E+02 

2.4E+02 

2.3E+02 

2.3E+02 

2.3E+02 

2.4E+02 

2.4E+02 

O.Oe+OO 


9 

10 

11 

12 

13 

14 

15 


SKC 

2.8E+02 

2.5E+02 

2.5E+02 

2.6E+02 

2.6E+02 

O.Oe+OO 

2.6E+02 


SKH 

2.8E+02 

2.5E+02 

2.5E+02 

2.6E+02 

O.Oe+OO 

2.6E+02 

2.7E+02 


SKB 

2.5E+02 

O.Oe+OO 

2.6E+02 

2.5E+02 

2.6E+02 

2.6E+02 

3.0E+02 


SNC 

2.5E+02 

2.4E+02 

2.5E+02 

2.6E+02 

2.5E+02 

O.Oe+OO 

2.6E+02 


SNH 

2.5E+02 

2.8E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.6E+02 

2.6E+02 


SNB 

2.4E+02 

2.5E+02 

2.9E+02 

2.6E+02 

2.6E+02 

2.6E+02 

2.6E+02 


MKC 

2.5E+02 

2.5E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.6E+02 

2.6E+02 


MKH 

2.5E+02 

2.5E+02 

2.5E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.6E+02 


MKB 

2.4E+02 

2.5E+02 

2.5E+02 

2.6E+02 

2.5E+02 

2.6E+02 

2.6E+02 


MNC 

2.5E+02 

2.5E+02 

2.5E+02 

2.6E+02 

2.5E+02 

O.Oe+OO 

2.6E+02 


MNH 

2.5E+02 

2.5E+02 

2.5E+02 

2.9E+02 

2.6E+02 

3.0E+02 

2.7E+02 


MNB 

2.5E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.6E+02 

2.6E+02 

2.6E+02 
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Table 30: Learned policy for H CO b by turn for 15 turn UrbanSim game using DQ-C. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

4.8E-316 

2.3E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.7E+02 

2.4E+02 

SKH 

2.4E+02 

O.Oe+OO 

2.3E+02 

2.4E+02 

2.6E+02 

2.4E+02 

2.4E+02 

2.5E+02 

SKB 

2.4E+02 

2.4E+02 

O.Oe+OO 

2.4E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.5E+02 

SNC 

2.4E+02 

2.4E+02 

2.4E+02 

2.3E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

SNH 

2.4E+02 

2.4E+02 

2.4E+02 

2.3E+02 

2.6E+02 

2.4E+02 

2.4E+02 

2.5E+02 

SNB 

2.4E+02 

2.3E+02 

2.6E+02 

2.4E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

MKC 

2.4E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.3E+02 

2.6E+02 

2.4E+02 

2.4E+02 

MKH 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.3E+02 

2.4E+02 

2.4E+02 

MKB 

2.3E+02 

2.4E+02 

2.3E+02 

2.4E+02 

O.Oe+OO 

2.3E+02 

2.4E+02 

2.4E+02 

MNC 

2.4E+02 

2.4E+02 

2.3E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.7E+02 

MNH 

2.4E+02 

2.4E+02 

2.3E+02 

2.6E+02 

O.Oe+OO 

2.4E+02 

2.4E+02 

2.7E+02 

MNB 

2.3E+02 

2.6E+02 

2.3E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 

2.4E+02 


9 

10 

11 

12 

13 

14 

15 


SKC 

2.5E+02 

O.Oe+OO 

2.5E+02 

2.9E+02 

2.9E+02 

2.6E+02 

2.7E+02 


SKH 

O.Oe+OO 

2.5E+02 

2.5E+02 

2.5E+02 

2.9E+02 

3.0E+02 

2.6E+02 


SKB 

2.5E+02 

2.5E+02 

2.6E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.6E+02 


SNC 

2.4E+02 

2.5E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.7E+02 

2.6E+02 


SNH 

2.5E+02 

2.5E+02 

2.5E+02 

2.6E+02 

2.6E+02 

3.0E+02 

3.0E+02 


SNB 

2.8E+02 

2.5E+02 

2.9E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.6E+02 


MKC 

2.4E+02 

2.8E+02 

2.6E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.6E+02 


MKH 

2.5E+02 

2.5E+02 

2.5E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.6E+02 


MKB 

2.5E+02 

2.5E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.7E+02 

2.5E+02 


MNC 

2.5E+02 

2.5E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.7E+02 

2.6E+02 


MNH 

2.4E+02 

2.5E+02 

2.5E+02 

2.6E+02 

2.6E+02 

2.6E+02 

2.6E+02 


MNB 

2.5E+02 

2.5E+02 

2.5E+02 

2.9E+02 

2.6E+02 

2.6E+02 

2.6E+02 
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Table 31: Learned policy for Battalion Commander by turn for 15 turn UrbanSim game 


using Q(A). 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

2.1E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


9 

10 

11 

12 

13 

14 

15 


SKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 
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Table 32: Learned policy for CA unit by turn for 15 turn UrbanSim game using Q(A) 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

7.2E-316 

7.7E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


9 

10 

11 

12 

13 

14 

15 


SKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 
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Table 33: Learned policy for E CO b by turn for 15 turn UrbanSim game using Q(A) 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

7.2E-316 

7.7E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


9 

10 

11 

12 

13 

14 

15 


SKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 
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Table 34: Learned policy for E CO a by turn for 15 turn UrbanSim game using Q(A) 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

7.2E-316 

7.7E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


9 

10 

11 

12 

13 

14 

15 


SKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 
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Table 35: Learned policy for F CO a by turn for 15 turn UrbanSim game using Q(A) 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

7.7E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


9 

10 

11 

12 

13 

14 

15 


SKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 
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Table 36: Learned policy for F CO b by turn for 15 turn UrbanSim game using Q(A) 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

7.7E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


9 

10 

11 

12 

13 

14 

15 


SKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 
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Table 37: Learned policy for G CO b by turn for 15 turn UrbanSim game using Q(A) 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

7.7E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


9 

10 

11 

12 

13 

14 

15 


SKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 
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Table 38: Learned policy for G CO a by turn for 15 turn UrbanSim game using Q(A) 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

7.7E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


9 

10 

11 

12 

13 

14 

15 


SKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 
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Table 39: Learned policy for H CO a by turn for 15 turn UrbanSim game using Q(A) 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

7.7E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SKH 

9.5E+01 

9.2E+01 

0.0E+00 

7.3E+01 

4.8E+01 

2.8E+01 

9.3E+01 

3.0E+02 

SKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SNC 

3.2E+01 

7.1E+01 

4.8E+01 

8.9E+01 

7.3E+01 

7.3E+01 

5.3E+01 

7.0E+01 

SNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SNB 

6.0E+01 

5.0E+01 

2.8E+02 

2.8E+02 

I.Oe+02 

7.4E+01 

O.Oe+OO 

5.1E+01 

MKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MKH 

9.3E+01 

5-Oe+OI 

I.Oe+02 

5.1E+01 

2.7E+01 

2.8E+01 

2.6E+01 

7.3E+01 

MKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MNC 

9.3E+01 

7.2E+01 

4.9E+01 

O.Oe+OO 

6.9E+01 

9.5E+01 

9.2E+01 

9.9E+01 

MNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MNB 

8.6E+01 

7.1E+01 

4.6E+01 

O.Oe+OO 

7.4E+01 

2.9E+02 

7.7E+01 

9.5E+01 


9 

10 

11 

12 

13 

14 

15 


SKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SKH 

2.7E+01 

2.9E+01 

5.7E+01 

5.7E+01 

3.4E+02 

3.4E+01 

l.lE+02 


SKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SNC 

7.7E+01 

I.Oe+02 

7.8E+01 

2.7E+01 

O.Oe+OO 

3.1E+01 

3.5E+02 


SNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SNB 

7.7E+01 

8.1E+01 

1.2E+02 

3.3E+02 

I.Oe+02 

l.lE+02 

8.9E+01 


MKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MKH 

7.6E+01 

3-Oe+OI 

2.9E+01 

8.6E+01 

6.0e+01 

6.1E+01 

1.2E+02 


MKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MNC 

9.2E+01 

7.9E+01 

5.5E+01 

5.7E+01 

5.6E+01 

8.0e+01 

1.3E+02 


MNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MNB 

3.1E+02 

3.0E+01 

1.2E+02 

1.2E+02 

8.4E+01 

8.1E+01 

5.2E+01 
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Table 40: Learned policy for QRF by turn for 15 turn UrbanSim game using Q(A) 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

7.2E-316 

6.3E-316 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SKH 

1.5E-94 

3.3E-86 

2.2E-57 

5.8E+252 

3.3E-86 

2.2E-57 

5.8E+252 

3.3E-86 

SKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNC 

5.8E+252 

3.3E-86 

2.2E-57 

5.8E+252 

1.2E-76 

7.6E-96 

7.3E+199 

8.0E-96 

SNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNB 

I.Ie-95 

1.3E-71 

1.2E+224 

1.2E-76 

8.1E-96 

7.3E+199 

3.3E-86 

2.2E-57 

MKC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKH 

8.2E-67 

1.2E+224 

1.2E-76 

8.5E-96 

7.3E+199 

3.3E-86 

2.2E-57 

5.8E+252 

MKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNC 

1.2E+224 

1.2E-76 

9.0E-96 

7.3E+199 

3.3E-86 

2.2E-57 

5.8E+252 

1.3E-76 

MNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNB 

1.2E-76 

9 

9.5E-96 

10 

7.3E+199 

11 

3.3E-86 

12 

2.2E-57 

13 

5.8E+252 

14 

1.3E-76 

15 

9.4E-96 

SKC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SKH 

2.2E-57 

5.8E+252 

3.3E-86 

2.2E-57 

5.8E+252 

3.3E-86 

2.2E-57 


SKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNC 

1.3E-71 

1.2E+224 

1.2E-76 

8.1E-96 

7.3E+199 

3.3E-86 

2.2E-57 


SNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNB 

5.8E+252 

3.3E-86 

2.2E-57 

5.8E+252 

3.3E-86 

2.2E-57 

5.8E+252 


MKC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKH 

1.3E-76 

9.4E-96 

1.5E-94 

1.3E-259 

4.0E+15 

8.1e-72 

1.3E-259 


MKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNC 

9.4E-96 

1.5E-94 

1.3E-259 

4.0E+15 

5.3E-67 

5.2E-58 

2.5E+06 


MNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNB 

1.5E-94 

1.3E-259 

4.0E+15 

3.5E-62 

5.2E-58 

2.5E+06 

9.0E-96 
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Table 41: Learned policy for H CO b by turn for 15 turn UrbanSim game using Q(A) 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

7.7E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


9 

10 

11 

12 

13 

14 

15 


SKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKC 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKB 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNH 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 

6.4E-316 


MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 
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Table 42: Learned policy for Battalion Commander by turn for 15 turn UrbanSim game 


using SARSA(A). 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

5.Ie-316 

4.8E-316 

4.9E-316 

4.9E-316 

6.2E-316 

4.9E-316 

2.0E-316 

SKH 

-5.3E+10 

-5.2E+10 

-5.6E+10 

-3.3E+10 

-3.2E+10 

-6.9E+10 

-2.7E+10 

-7.2E+10 

SKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNC 

-5.3E+10 

-5.1E+10 

-5.6E+10 

-3.2E+10 

-3.2E+10 

-6.8E+10 

-2.7E+10 

-7.2E+10 

SNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNB 

-5.2E+10 

-5.2E+10 

-5.6E+10 

-3.3E+10 

-3.2E+10 

-6.8E+10 

8.8E+14 

3.3E+14 

MKC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKH 

-5.3E+10 

-5.1E+10 

-5.5E+10 

-3.3E+10 

-3.1E+10 

-6.9E+10 

-2.7E+10 

-7.2E+10 

MKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNC 

6.8E+14 

-5.2E+10 

-5.6E+10 

-3.3E+10 

-3.2E+10 

-6.9E+10 

-2.7E+10 

-7.1E+10 

MNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNB 

-5.3E+10 

9 

7.3E+14 

10 

-5.6E+10 

11 

-3.3E+10 

12 

-3.2E+10 

13 

-6.8E+10 

14 

-2.7E+10 

15 

-7.1E+10 

SKC 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 


SKH 

-2.7E+10 

-6.0E+10 

-4.1E+10 

-7.5E+10 

-5.0e+10 

-5.2E+10 

-4.9E+10 


SKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNC 

-2.7E+10 

-6.0e+10 

-4.1E+10 

-7.5E+10 

9.4E+14 

7.0E+14 

7.3E+14 


SNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNB 

-2.7E+10 

-5.9E+10 

-4.1E+10 

-7.5E+10 

-5-Oe+IO 

-5.2E+10 

-4.9E+10 


MKC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKH 

-2.7E+10 

-6.0E+10 

-4.1E+10 

-7.6E+10 

-5.0e+10 

-5.1E+10 

-4.9E+10 


MKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNC 

-2.7E+10 

-6.0E+10 

7.8E+14 

5.5E+14 

-5.1e+10 

-5.2E+10 

-4.9E+10 


MNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNB 

-2.7E+10 

-6.0E+10 

-4.1E+10 

-7.5E+10 

-5.0e+10 

-5.2E+10 

-4.9E+10 
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Table 43: Learned policy for CA unit by turn for 15 turn UrbanSim game using SARSA(A) 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

5.Ie-316 

4.8E-316 

4.9E-316 

4.9E-316 

6.2E-316 

4.9E-316 

2.0E-316 

SKH 

-5.4E+12 

-5.2E+12 

-5.8E+12 

-3.0E+12 

-2.9E+12 

-8.1E+12 

-2.4E+12 

-8.7E+12 

SKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNC 

-5.4E+12 

-5.2E+12 

5.5E+14 

-3.0E+12 

-2.9E+12 

-8.0E+12 

-2.4E+12 

3.3E+14 

SNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNB 

-5.5E+12 

-5.2E+12 

-5.8E+12 

-3.0E+12 

-2.9E+12 

3.5E+14 

-2.3E+12 

-8.6E+12 

MKC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKH 

-5.4E+12 

-5.2E+12 

-5.9E+12 

-2.9E+12 

-2.9E+12 

-8.1E+12 

-2.3E+12 

-8.6E+12 

MKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNC 

-5.5E+12 

5.7E+14 

-5.9E+12 

-3.0E+12 

5.2E+14 

-8.2E+12 

-2.4E+12 

-8.6E+12 

MNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNB 

5.5E+14 

9 

-5.3E+12 

10 

-5.9E+12 

11 

-3.0E+12 

12 

-2.9E+12 

13 

-8.1E+12 

14 

-2.3E+12 

15 

-8.6E+12 

SKC 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 


SKH 

-2.4E+12 

3.5E+14 

-4.0E+12 

-8.8E+12 

-5.1E+12 

-5.3E+12 

-5.0E+12 


SKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNC 

-2.3E+12 

-6.9E+12 

-4.0E+12 

-8.7E+12 

-5.0E+12 

-5.2E+12 

-5.0E+12 


SNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNB 

-2.3E+12 

-6.9E+12 

-4.0E+12 

-8.8E+12 

-5.0E+12 

-5.3E+12 

-5.0E+12 


MKC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKH 

-2.3E+12 

-6.9E+12 

-3.9E+12 

-8.8E+12 

-5.0E+12 

-5.3E+12 

-5.0E+12 


MKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNC 

-2.3E+12 

-6.9E+12 

-3.9E+12 

-8.8E+12 

-5.0E+12 

-5.3E+12 

-4.9E+12 


MNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNB 

-2.4E+12 

-6.8E+12 

-4.0E+12 

-8.7E+12 

-5.1E+12 

-5.3E+12 

-4.9E+12 
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Table 44: Learned policy for E CO a by turn for 15 turn UrbanSim game using SARSA(A) 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

5.Ie-316 

4.8E-316 

4.9E-316 

4.9E-316 

6.2E-316 

4.9E-316 

2.0E-316 

SKH 

-2.2E+14 

-2.1E+14 

-2.7E+14 

-2.7E+14 

-2.2E+14 

-1.7E+14 

-2.6E+14 

-1.6E+14 

SKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNC 

-2.8E+14 

-2.3E+14 

-2.4E+14 

-2.3E+14 

-1.8E+14 

-1.9E+14 

-2.2E+14 

-1.5E+14 

SNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNB 

-2.1E+14 

-2.8E+14 

-2.1E+14 

-3.1E+14 

-1.9E+14 

-1.6E+14 

-2.5E+14 

-1.7E+14 

MKC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKH 

-2.3E+14 

-2.6E+14 

-2.7E+14 

-2.9E+14 

-2.1E+14 

-2.1E+14 

-3.2E+14 

-2.0E+14 

MKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNC 

-2.7E+14 

-2.9E+14 

-2.8E+14 

-2.8E+14 

-2.5E+14 

-1.8E+14 

-2.6E+14 

-1.9E+14 

MNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNB 

-2.6E+14 

9 

-2.5E+14 

10 

-2.3E+14 

11 

-2.3E+14 

12 

-1.8E+14 

13 

-2.0E+14 

14 

-3.1E+14 

15 

-1.4E+14 

SKC 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 


SKH 

-2.4E+14 

-1.6E+14 

-2.6E+14 

-2.2E+14 

-3.2E+14 

-2.5E+14 

-2.3E+14 


SKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNC 

-2.9E+14 

-1.9E+14 

-2.1E+14 

-2.0E+14 

-3.1E+14 

-2.3E+14 

-3.2E+14 


SNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNB 

-2.8E+14 

-2.0E+14 

-2.7E+14 

-2.0E+14 

-2.6E+14 

-2.8E+14 

-3.1E+14 


MKC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKH 

-2.5E+14 

-1.9E+14 

-2.1E+14 

-2.3E+14 

-2.5E+14 

-2.2E+14 

-2.4E+14 


MKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNC 

-3.1E+14 

-1.7E+14 

-2.2E+14 

-2.4E+14 

-2.4E+14 

-2.6E+14 

-2.5E+14 


MNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNB 

-2.7E+14 

-1.8E+14 

-2.5E+14 

-1.7E+14 

-2.7E+14 

-2.4E+14 

-2.6E+14 
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Table 45: Learned policy for E CO b by turn for 15 turn UrbanSim game using SARSA(A) 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

5.Ie-316 

4.8E-316 

4.9E-316 

4.9E-316 

6.2E-316 

4.9E-316 

2.0E-316 

SKH 

-2.5E+11 

-2.4E+11 

-2.6E+11 

-1.5E+11 

-1.4E+11 

-3.3E+11 

-1.2E+11 

-3.5E+11 

SKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNC 

-2.5E+11 

-2.4E+11 

9.9E+14 

-1.5E+11 

-1.4E+11 

-3.3E+11 

-1.2E+11 

-3.5E+11 

SNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNB 

-2.4E+11 

-2.4E+11 

-2.6E+11 

1.4E+12 

-1.4E+11 

-3.4E+11 

-1.2E+11 

-3.5E+11 

MKC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKH 

-2.5E+11 

-2.4E+11 

-2.6E+11 

1.1e+15 

-1.4E+11 

-3.3E+11 

-1.2E+11 

-3.5E+11 

MKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNC 

-2.5E+11 

I.Oe+15 

-2.6E+11 

-1.5E+11 

-1.4E+11 

-3.3E+11 

-1.2E+11 

-3.5E+11 

MNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNB 

-2.5E+11 

9 

-2.4E+11 

10 

-2.6E+11 

11 

-1.5E+11 

12 

-1.4E+11 

13 

-3.3E+11 

14 

-1.2E+11 

15 

-3.5E+11 

SKC 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 


SKH 

-1.2E+11 

6.2E+14 

-1.9E+11 

-3.7E+11 

l.lE+15 

-2.4E+11 

-2.3E+11 


SKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNC 

-1.2E+11 

-2.9E+11 

-1.9E+11 

-3.6E+11 

-2.3E+11 

-2.4E+11 

-2.3E+11 


SNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNB 

-1.2E+11 

-2.9E+11 

-1.9E+11 

-3.7E+11 

-2.3E+11 

-2.4E+11 

-2.3E+11 


MKC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKH 

-1.2E+11 

-2.9E+11 

9.7E+14 

-3.7E+11 

-2.3E+11 

-2.4E+11 

-2.3E+11 


MKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNC 

I.Ie+15 

-2.9E+11 

-1.9E+11 

7.9E+14 

-2.3E+11 

-2.4E+11 

-2.3E+11 


MNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNB 

-1.2E+11 

-2.9E+11 

-1.9E+11 

-3.6E+11 

-2.3E+11 

-2.4E+11 

-2.3E+11 
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Table 46: Learned policy for F CO b by turn for 15 turn UrbanSim game using SARSA(A) 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

5.Ie-316 

4.8E-316 

4.9E-316 

4.9E-316 

6.2E-316 

4.9E-316 

2.0E-316 

SKH 

-2.6E+13 

-2.4E+13 

-2.7E+13 

-1.3E+13 

-1.3E+13 

-4.0E+13 

2.8E+14 

-4.3E+13 

SKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNC 

-2.6E+13 

-2.4E+13 

-2.8E+13 

-1.3E+13 

-1.3E+13 

-4.0E+13 

-I.Oe+13 

-4.4E+13 

SNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNB 

-2.5E+13 

-2.4E+13 

-2.8E+13 

-1.3E+13 

-1.3E+13 

-4.0E+13 

-I.Oe+13 

-4.4E+13 

MKC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKH 

-2.6E+13 

-2.4E+13 

-2.8E+13 

-1.3E+13 

-1.3E+13 

-4.1E+13 

-I.Oe+13 

-4.3E+13 

MKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNC 

1.8E+14 

-2.4E+13 

-2.8E+13 

-1.3E+13 

-1.3E+13 

5.3E+13 

-I.Oe+13 

-4.4E+13 

MNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNB 

-2.6E+13 

9 

-2.4E+13 

10 

-2.8E+13 

11 

2.8E+14 

12 

-1.3E+13 

13 

-4.0E+13 

14 

-I.Oe+13 

15 

-4.4E+13 

SKC 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 


SKH 

-I.Oe+13 

-3.4E+13 

-1.8E+13 

-4.3E+13 

-2.4E+13 

-2.5E+13 

-2.3E+13 


SKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNC 

-I.Oe+13 

-3.4E+13 

-1.8E+13 

-4.3E+13 

-2.4E+13 

-2.5E+13 

-2.3E+13 


SNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNB 

2.9E+14 

-3.4E+13 

-1.8E+13 

-4.3E+13 

-2.3E+13 

-2.5E+13 

2.1E+14 


MKC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKH 

-I.Oe+13 

-3.4E+13 

-1.8E+13 

6.2E+13 

-2.3E+13 

-2.5E+13 

-2.3E+13 


MKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNC 

-I.Oe+13 

-3.4E+13 

-1.8E+13 

-4.3E+13 

-2.4E+13 

2.0E+14 

-2.3E+13 


MNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNB 

-I.Oe+13 

6.7E+13 

-1.8E+13 

-4.3E+13 

-2.4E+13 

-2.5E+13 

-2.3E+13 
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Table 47: Learned policy for F CO a by turn for 15 turn UrbanSim game using SARSA(A) 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

5. Ie-316 

4.8E-316 

4.9E-316 

4.9E-316 

6.2E-316 

4.9E-316 

2.0E-316 

SKH 

-1.2E+14 

-7.4E+13 

-I.Ie+14 

-1.2E+14 

-8.5E+13 

-6.2E+13 

-1.4E+14 

-4.5E+13 

SKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNC 

-I.Ie+14 

-I.Ie+14 

-9.5E+13 

-1.5E+14 

-7.8E+13 

-5.9E+13 

-I.Ie+14 

-5.0E+13 

SNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNB 

-8.2E+13 

-I.Ie+14 

-9.9E+13 

-1.4E+14 

-7.1E+13 

-4.3E+13 

-8.7E+13 

-3.1E+13 

MKC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKH 

-9.0E+13 

-1.2E+14 

-7.1E+13 

-1.2E+14 

-I.Ie+14 

-3.3E+13 

-1.2E+14 

-6.1E+13 

MKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNC 

-I.Ie+14 

-9.4E+13 

-1.2E+14 

-1.3E+14 

-I.Oe+14 

-5.6E+13 

-I.Oe+14 

-3.9E+13 

MNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNB 

-9.5E+13 

9 

-8.6E+13 

10 

-8.6E+13 

11 

-I.Oe+14 

12 

-8.1E+13 

13 

-6.5E+13 

14 

-1.5E+14 

15 

-5.3E+13 

SKC 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 


SKH 

-1.2E+14 

-6.3E+13 

-I.IE+14 

-3.7E+13 

-7.9E+13 

-1.2E+14 

-1.2E+14 


SKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNC 

-9.0E+13 

-3.8E+13 

-7.0E+13 

-6.0E+13 

-8.8E+13 

-I.Ie+14 

-8.2E+13 


SNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNB 

-I.Ie+14 

-4.4E+13 

-I.Oe+14 

-5.1E+13 

-1.3E+14 

-9.1E+13 

-7.8E+13 


MKC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKH 

-1.2E+14 

-4.6E+13 

-8.4E+13 

-4.9E+13 

-I.Ie+14 

-7.5E+13 

-7.5E+13 


MKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNC 

-1.5E+14 

-5.7E+13 

-9.2E+13 

-7.3E+13 

-8.3E+13 

-7.2E+13 

-8.6E+13 


MNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNB 

-I.Ie+14 

-4.9E+13 

-6.7E+13 

-6.3E+13 

-1.2E+14 

-9.6E+13 

-I.Ie+14 
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Table 48: Learned policy for G CO a by turn for 15 turn UrbanSim game using SARSA(A) 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

5.Ie-316 

4.8E-316 

4.9E-316 

4.9E-316 

6.2E-316 

4.9E-316 

2.0E-316 

SKH 

-1.7E+14 

-1.6E+14 

-1.8E+14 

-1.7E+14 

-1.4E+14 

-I.Oe+14 

-1.9E+14 

-1.4E+14 

SKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNC 

-2.1E+14 

-2.1E+14 

-1.7E+14 

-2.4E+14 

-1.8E+14 

-1.5E+14 

-2.2E+14 

-1.4E+14 

SNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNB 

-1.9E+14 

-2.0E+14 

-1.6E+14 

-2.6E+14 

-2.0E+14 

-1.2E+14 

-2.1E+14 

-I.Ie+14 

MKC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKH 

-1.5E+14 

-1.7E+14 

-2.3E+14 

-2.1E+14 

-1.5E+14 

-1.1 E+14 

-1.7E+14 

-1.5E+14 

MKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNC 

-1.5E+14 

-2.4E+14 

-2.0E+14 

-2.5E+14 

-1.5E+14 

-1.3E+14 

-2.3E+14 

-1.2E+14 

MNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNB 

-1.8E+14 

9 

-2.2E+14 

10 

-2.2E+14 

11 

-2.2E+14 

12 

-1.4e+14 

13 

-1.4E+14 

14 

-2.5E+14 

15 

-1.3E+14 

SKC 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 


SKH 

-2.0E+14 

-1.5E+14 

-2.1E+14 

-1.6E+14 

-1.9E+14 

-1.8E+14 

-1.8E+14 


SKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNC 

-2.6E+14 

-1.4E+14 

-1.7E+14 

-1.2E+14 

-2.5E+14 

-1.6E+14 

-2.3E+14 


SNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNB 

-1.7E+14 

-I.Oe+14 

-1.6E+14 

-1.7E+14 

-2.4E+14 

-2.1E+14 

-1.9E+14 


MKC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKH 

-2.2E+14 

-1.4E+14 

-2.0E+14 

-1.6E+14 

-1.8E+14 

-2.0E+14 

-1.7E+14 


MKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNC 

-2.1E+14 

-9.8E+13 

-1.7E+14 

-1.3E+14 

-1.7E+14 

-1.7E+14 

-2.4E+14 


MNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNB 

-2.5E+14 

-1.3E+14 

-1.4E+14 

-1.2E+14 

-2.6E+14 

-1.9E+14 

-2.0E+14 
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Table 49: Learned policy for G CO b by turn for 15 turn UrbanSim game using SARSA(A) 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

5.Ie-316 

4.8E-316 

4.9E-316 

4.9E-316 

6.2E-316 

4.9E-316 

2.0E-316 

SKH 

-1.2E+12 

-I.Ie+12 

-1.2E+12 

-6.6E+11 

-6.5E+11 

-1.6E+12 

-5.4E+11 

-1.7E+12 

SKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNC 

-1.2E+12 

-I.Ie+12 

-1.2E+12 

-6.7E+11 

-6.5E+11 

-1.6E+12 

-5.3E+11 

-1.7E+12 

SNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNB 

-1.2E+12 

-I.Ie+12 

-1.2E+12 

-6.5E+11 

-6.5E+11 

6.1E+14 

-5.2E+11 

-1.7E+12 

MKC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKH 

8.7E+14 

-I.Ie+12 

-1.2E+12 

-6.7E+11 

-6.5E+11 

-1.7E+12 

-5.3E+11 

5.8E+14 

MKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNC 

-1.2E+12 

9.0E+14 

-1.2E+12 

-6.6E+11 

-6.5E+11 

-1.6E+12 

-5.4E+11 

-1.8E+12 

MNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNB 

-1.2E+12 

9 

-I.Ie+12 

10 

-1.2E+12 

11 

-6.6E+11 

12 

-6.4E+11 

13 

-1.7E+12 

14 

-5.3E+11 

15 

-1.7E+12 

SKC 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 


SKH 

-5.4E+11 

-1.4E+12 

-8.7E+11 

-1.8E+12 

-I.Ie+12 

-l.lE+12 

-I.Ie+12 


SKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNC 

-5.3E+11 

-1.4E+12 

-8.7E+11 

-1.8E+12 

-I.Ie+12 

-I.Ie+12 

-I.Ie+12 


SNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNB 

-5.3E+11 

-1.4E+12 

-8.6E+11 

-1.8E+12 

-I.Ie+12 

-I.Ie+12 

-I.Ie+12 


MKC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKH 

9.6E+14 

-1.4E+12 

-8.6E+11 

-1.8E+12 

-I.Ie+12 

-I.Ie+12 

-I.Ie+12 


MKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNC 

-5.3E+11 

-1.4E+12 

-8.7E+11 

-1.8E+12 

9.7E+14 

-I.Ie+12 

-I.Ie+12 


MNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNB 

-5.4E+11 

5.8E+14 

-8.7E+11 

-1.8E+12 

-I.Ie+12 

-I.Ie+12 

-I.Ie+12 
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Table 50: Learned policy for H CO a by turn for 15 turn UrbanSim game using SARSA(A) 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

5. Ie-316 

4.8E-316 

4.9E-316 

4.9E-316 

6.2E-316 

4.9E-316 

2.0E-316 

SKH 

-l.lE+14 

-1.2E+14 

-l.lE+14 

-1.3E+14 

-6.0E+13 

-5.8E+13 

-1.8E+14 

-4.4E+13 

SKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNC 

-I.Ie+14 

-1.2E+14 

-I.Ie+14 

-1.3E+14 

-6.1E+13 

-5.9E+13 

-1.7E+14 

-4.4E+13 

SNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNB 

-I.Ie+14 

-1.2E+14 

-I.Ie+14 

-1.3E+14 

-6.0E+13 

-5.9E+13 

-1.8E+14 

-4.4E+13 

MKC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKH 

-I.Ie+14 

-1.2E+14 

-I.Ie+14 

-1.3E+14 

-2.2E+13 

-5.8E+13 

-1.6E+14 

-4.4E+13 

MKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNC 

-I.Ie+14 

-1.2E+14 

-I.Ie+14 

-1.3E+14 

-5.9E+13 

-5.9E+13 

-1.8E+14 

-4.4E+13 

MNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNB 

-I.Ie+14 

9 

-1.2E+14 

10 

-I.Ie+14 

11 

-1.3E+14 

12 

-6.1E+13 

13 

-5.9E+13 

14 

-1.7E+14 

15 

-4.4E+13 

SKC 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 


SKH 

-1.9E+14 

-4.4E+13 

-1.4E+14 

-8.2E+13 

-1.8E+14 

-l.lE+14 

-1.2E+14 


SKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNC 

-1.7E+14 

-4.4E+13 

-1.6E+14 

-8.1E+13 

-1.9E+14 

-I.Oe+14 

-1.2E+14 


SNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNB 

-1.8E+14 

-4.4E+13 

-1.6E+14 

-8.2E+13 

-1.8E+14 

-I.Ie+14 

-I.Ie+14 


MKC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKH 

-1.8E+14 

-4.4E+13 

-1.6E+14 

-7.1E+13 

-2.0E+14 

-I.Ie+14 

-1.2E+14 


MKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNC 

-1.7E+14 

-4.4E+13 

-1.5E+14 

-8.2E+13 

-1.8E+14 

-I.Ie+14 

-1.2E+14 


MNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNB 

-1.7E+14 

-4.4E+13 

-1.6E+14 

-8.2E+13 

-2.0E+14 

-I.Ie+14 

-1.2E+14 
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Table 51: Learned policy for QRF by turn for 15 turn UrbanSim game using SARSA(A) 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

5.Ie-316 

-2.6E+14 

-2.4E+14 

-2.0E+14 

-1.7E+14 

-3.0E+14 

-1.3E+14 

SKH 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

SKB 

-2.4E+14 

-2.5E+14 

-2.6E+14 

-2.4E+14 

-1.6E+14 

-1.6E+14 

-2.9E+14 

-1.4E+14 

SNC 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

SNH 

-2.3E+14 

-2.4E+14 

-2.4E+14 

-2.8E+14 

-2.0E+14 

-1.6E+14 

-2.4E+14 

-1.5E+14 

SNB 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

MKC 

-2.1E+14 

-2.8E+14 

-2.7E+14 

-2.3E+14 

-1.7E+14 

-1.6E+14 

-2.4E+14 

-1.4E+14 

MKH 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

MKB 

-2.4E+14 

-2.7E+14 

-2.5E+14 

-2.6E+14 

-1.7E+14 

-1.8E+14 

-2.5E+14 

-1.5E+14 

MNC 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

MNH 

-2.1E+14 

-2.3E+14 

-2.7E+14 

-2.7E+14 

-1.8E+14 

-1.7E+14 

-2.7E+14 

-1.4E+14 

MNB 

6.3E-316 

9 

6.3E-316 

10 

6.3E-316 

11 

6.3E-316 

12 

6.3E-316 

13 

6.3E-316 

14 

6.3E-316 

15 

6.3E-316 

SKC 

-2.4E+14 

-1.4E+14 

-2.8E+14 

-2.0E+14 

-3.2E+14 

-2.5E+14 

-2.5E+14 


SKH 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 


SKB 

-2.5E+14 

-1.5E+14 

-2.1E+14 

-2.2E+14 

-2.6E+14 

-2.8E+14 

-2.7E+14 


SNC 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 


SNH 

-2.7E+14 

-1.6E+14 

-2.6E+14 

-1.8E+14 

-2.5E+14 

-2.7E+14 

-2.2E+14 


SNB 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 


MKC 

-3.0E+14 

-1.4E+14 

-2.1E+14 

-2.1E+14 

-3.2E+14 

-2.3E+14 

-2.8E+14 


MKH 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 


MKB 

-3.0E+14 

-1.3E+14 

-2.5E+14 

-1.9E+14 

-2.9E+14 

-2.2E+14 

-2.2E+14 


MNC 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 


MNH 

-2.9E+14 

-1.4E+14 

-2.3E+14 

-1.7E+14 

-2.8E+14 

-2.4E+14 

-2.4E+14 


MNB 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 

6.3E-316 
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Table 52: Learned policy for H CO b by turn for 15 turn UrbanSim game using SARSA(A) 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

5.Ie-316 

4.8E-316 

4.9E-316 

4.9E-316 

6.2E-316 

4.9E-316 

2.0E-316 

SKH 

-l.lE+10 

-1.9E+13 

-7.1E+09 

-1.8E+13 

-1.6E+13 

-6.1E+09 

-2.5E+13 

-6.1E+09 

SKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNC 

-l.lE+10 

-1.2E+10 

-7.3E+09 

-1.9E+13 

-1.3E+13 

-6.1E+09 

-4.0E+13 

-6.1E+09 

SNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNB 

-l.lE+10 

-1.2E+10 

-7.2E+09 

-3.9E+13 

-1.3E+13 

-6.1E+09 

-2.7E+13 

3.3E+14 

MKC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKH 

-l.lE+10 

-1.2E+10 

-7.2E+09 

-1.7E+13 

-1.5E+13 

3.0E+14 

-3.4E+13 

-6.1E+09 

MKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNC 

-l.lE+10 

-1.2E+10 

-7.2E+09 

-1.6E+13 

-1.2E+13 

-6.1E+09 

-2.0E+13 

-6.2E+09 

MNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNB 

3.0E+12 

9 

-1.2E+10 

10 

-7.2E+09 

11 

-1.5E+13 

12 

-1.9E+13 

13 

-6.1E+09 

14 

-3.0E+13 

15 

-6.1E+09 

SKC 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 

6.2E-316 


SKH 

-2.3E+13 

-9.0E+09 

-1.8E+13 

-l.lE+10 

-1.2E+13 

-l.lE+10 

-l.lE+10 


SKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNC 

-3.7E+13 

-8.9E+09 

-1.6E+10 

-l.lE+10 

-2.0E+13 

-l.lE+10 

-1.1 E+10 


SNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNB 

-2.9E+13 

-8.8E+09 

-1.5E+13 

-l.lE+10 

-l.lE+10 

-l.lE+10 

-l.lE+10 


MKC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKH 

-3.9E+13 

-8.9E+09 

-1.7E+13 

-l.lE+10 

-l.lE+10 

-I.Oe+IO 

-l.lE+10 


MKB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNC 

-4.1E+13 

-9.0e+09 

-1.6E+10 

3.1E+14 

-l.lE+10 

-l.lE+10 

-2.0E+13 


MNH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNB 

-3.1E+13 

-8.9E+09 

-1.6E+10 

-l.lE+10 

-l.lE+10 

-l.lE+10 

-2.1E+13 
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Table 53: Learned policy for Battalion Commander by turn for 15 turn UrbanSim game 
using DQ-C, Boltzmann. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

9.1e-316 

9.5E-316 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 

5.0E+223 

2.2E-57 

SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SKB 

1.3E-76 

7.6E-96 

4.8E+228 

5.0E+223 

8.0E-72 

9.9E-96 

4.0E+252 

2.3E+15 

SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNH 

3.4E-53 

3.2E+160 

1.8E-152 

2.6E+180 

1.2E-76 

1.1e-95 

1.2E+214 

1.3E+219 

SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKC 

2.6E+180 

7.6E-96 

I.Ie-42 

2.9E+161 

1.4E+219 

8.1e-72 

1.3E-76 

7.6E-96 

MKH 

2.5E+02 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

2.6E+02 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKB 

9.3E+252 

9.1E+227 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 

5.0E+223 

9.6E-48 

MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNH 

6.0E+175 

9.7E-72 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 

1.0e-42 

6.8E-38 

MNB 

O.Oe+OO 

9 

O.Oe+OO 

10 

O.Oe+OO 

11 

O.Oe+OO 

12 

O.Oe+OO 

13 

O.Oe+OO 

14 

O.Oe+OO 

15 

2.7E+02 

SKC 

4.1e-38 

9.3E+252 

9.1E+227 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 


SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SKB 

I.Ie+243 

5.2E-67 

3.2E-57 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 


SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

3.1E+02 


SNH 

9.7E-72 

3.4E-53 

3.2E+160 

1.8E-152 

2.6E+180 

1.2E-76 

I.Ie-95 


SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKC 

4.8E+228 

5.0E+223 

1.2E-76 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 


MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKB 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 


MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNH 

1.2E-95 

1.2E+214 

1.3E+219 

4.2E-62 

3.4E-53 

3.2E+160 

1.8E-152 


MNB 

O.Oe+OO 

O.Oe+OO 

2.9E+02 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 
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Table 54: Learned policy for CA unit by turn for 15 turn UrbanSim game using DQ-C, 
Boltzmann. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

9. 1e-316 

9.5E-316 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 

5.0E+223 

2.2E-57 

SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SKB 

1.3E-76 

7.6E-96 

4.8E+228 

5.0E+223 

8.0E-72 

9.9E-96 

4.0E+252 

2.3E+15 

SNC 

O.Oe+OO 

2.5E+02 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNH 

3.4E-53 

3.2E+160 

1.8E-152 

2.6E+180 

1.2E-76 

1.Ie-95 

1.2E+214 

1.3E+219 

SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

2.7E+02 

O.Oe+OO 

MKC 

2.6E+180 

7.6E-96 

1.1e-42 

2.9E+161 

1.4E+219 

8.1e-72 

1.3E-76 

7.6E-96 

MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

2.6E+02 

O.Oe+OO 

O.Oe+OO 

MKB 

9.3E+252 

9.1E+227 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 

5.0E+223 

9.6E-48 

MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNH 

6.0E+175 

9.7E-72 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 

1.0e-42 

6.8E-38 

MNB 

O.Oe+OO 

9 

O.Oe+OO 

10 

2.6E+02 

11 

O.Oe+OO 

12 

O.Oe+OO 

13 

O.Oe+OO 

14 

O.Oe+OO 

15 

O.Oe+OO 

SKC 

4.1e-38 

9.3E+252 

9.1E+227 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 


SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SKB 

I.Ie+243 

5.2E-67 

3.2E-57 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 


SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

3.0E+02 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNH 

9.7E-72 

3.4E-53 

3.2E+160 

1.8E-152 

2.6E+180 

1.2E-76 

1.Ie-95 


SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKC 

4.8E+228 

5.0E+223 

1.2E-76 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 


MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKB 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 


MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNH 

1.2E-95 

1.2E+214 

1.3E+219 

4.2E-62 

3.4E-53 

3.2E+160 

1.8E-152 


MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 
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Table 55: Learned policy for E CO a by turn for 15 turn UrbanSim game using DQ-C, 
Boltzmann. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

9. 1e-316 

9.5E-316 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 

5.0E+223 

2.2E-57 

SKH 

O.Oe+OO 

2.5E+02 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SKB 

1.3E-76 

7.6E-96 

4.8E+228 

5.0E+223 

8.0E-72 

9.9E-96 

4.0E+252 

2.3E+15 

SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

2.6E+02 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNH 

3.4E-53 

3.2E+160 

1.8E-152 

2.6E+180 

1.2E-76 

1.Ie-95 

1.2E+214 

1.3E+219 

SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKC 

2.6E+180 

7.6E-96 

1.1e-42 

2.9E+161 

1.4E+219 

8.1e-72 

1.3E-76 

7.6E-96 

MKH 

O.Oe+OO 

O.Oe+OO 

2.6e+02 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKB 

9.3E+252 

9.1E+227 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 

5.0E+223 

9.6E-48 

MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

2.7E+02 

O.Oe+OO 

MNH 

6.0E+175 

9.7E-72 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 

I.Oe-42 

6.8E-38 

MNB 

O.Oe+OO 

9 

O.Oe+OO 

10 

O.Oe+OO 

11 

O.Oe+OO 

12 

O.Oe+OO 

13 

O.Oe+OO 

14 

O.Oe+OO 

15 

O.Oe+OO 

SKC 

4.1e-38 

9.3E+252 

9.1E+227 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 


SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SKB 

I.Ie+243 

5.2E-67 

3.2E-57 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 


SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

3.0E+02 

3.0E+02 

O.Oe+OO 


SNH 

9.7E-72 

3.4E-53 

3.2E+160 

1.8E-152 

2.6E+180 

1.2E-76 

1.Ie-95 


SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKC 

4.8E+228 

5.0E+223 

1.2E-76 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 


MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKB 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 


MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNH 

1.2E-95 

1.2E+214 

1.3E+219 

4.2E-62 

3.4E-53 

3.2E+160 

1.8E-152 


MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 
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Table 56: Learned policy for E CO b by turn for 15 turn UrbanSim game using DQ-C, 
Boltzmann. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

9. 1e-316 

9.5E-316 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 

5.0E+223 

2.2E-57 

SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SKB 

1.3E-76 

7.6E-96 

4.8E+228 

5.0E+223 

8.0E-72 

9.9E-96 

4.0E+252 

2.3E+15 

SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

2.6E+02 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNH 

3.4E-53 

3.2E+160 

1.8E-152 

2.6E+180 

1.2E-76 

1.Ie-95 

1.2E+214 

1.3E+219 

SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

2.7E+02 

O.Oe+OO 

MKC 

2.6E+180 

7.6E-96 

I.Ie-42 

2.9E+161 

1.4E+219 

8.1e-72 

1.3E-76 

7.6E-96 

MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

2.6E+02 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKB 

9.3E+252 

9.1E+227 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 

5.0E+223 

9.6E-48 

MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNH 

6.0E+175 

9.7E-72 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 

1.0e-42 

6.8E-38 

MNB 

2.5E+02 

9 

O.Oe+OO 

10 

O.Oe+OO 

11 

O.Oe+OO 

12 

O.Oe+OO 

13 

O.Oe+OO 

14 

O.Oe+OO 

15 

O.Oe+OO 

SKC 

4.1e-38 

9.3E+252 

9.1E+227 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 


SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SKB 

I.Ie+243 

5.2E-67 

3.2E-57 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 


SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

3.1E+02 


SNH 

9.7E-72 

3.4E-53 

3.2E+160 

1.8E-152 

2.6E+180 

1.2E-76 

1.Ie-95 


SNB 

O.Oe+OO 

2.8E+02 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKC 

4.8E+228 

5.0E+223 

1.2E-76 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 


MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKB 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 


MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

3.0E+02 

O.Oe+OO 


MNH 

1.2E-95 

1.2E+214 

1.3E+219 

4.2E-62 

3.4E-53 

3.2E+160 

1.8E-152 


MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 
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Table 57: Learned policy for F CO a by turn for 15 turn UrbanSim game using DQ-C, 
Boltzmann. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.4E-316 

9.5E-316 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 

5.0E+223 

2.2E-57 

SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SKB 

1.3E-76 

7.6E-96 

4.8E+228 

5.0E+223 

8.0E-72 

9.9E-96 

4.0E+252 

2.3E+15 

SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNH 

3.4E-53 

3.2E+160 

1.8E-152 

2.6E+180 

1.2E-76 

1.Ie-95 

1.2E+214 

1.3E+219 

SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

2.6E+02 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKC 

2.6E+180 

7.6E-96 

I.Ie-42 

2.9E+161 

1.4E+219 

8.1E-72 

1.3E-76 

7.6E-96 

MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

2.7E+02 

MKB 

9.3E+252 

9.1E+227 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 

5.0E+223 

9.6E-48 

MNC 

O.Oe+OO 

2.5E+02 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNH 

6.0E+175 

9.7E-72 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 

1.0e-42 

6.8E-38 

MNB 

O.Oe+OO 

9 

O.Oe+OO 

10 

O.Oe+OO 

11 

O.Oe+OO 

12 

O.Oe+OO 

13 

O.Oe+OO 

14 

O.Oe+OO 

15 

O.Oe+OO 

SKC 

4.1e-38 

9.3E+252 

9.1E+227 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 


SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SKB 

I.Ie+243 

5.2E-67 

3.2E-57 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 


SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

3.0E+02 

O.Oe+OO 

O.Oe+OO 


SNH 

9.7E-72 

3.4E-53 

3.2E+160 

1.8E-152 

2.6E+180 

1.2E-76 

1.Ie-95 


SNB 

2.8E+02 

2.8E+02 

2.9E+02 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKC 

4.8E+228 

5.0E+223 

1.2E-76 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 


MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKB 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 


MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNH 

1.2E-95 

1.2E+214 

1.3E+219 

4.2E-62 

3.4E-53 

3.2E+160 

1.8E-152 


MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

3.1E+02 
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Table 58: Learned policy for F CO b by turn for 15 turn UrbanSim game using DQ-C, 
Boltzmann. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.4E-316 

9.5E-316 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 

5.0E+223 

2.2E-57 

SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SKB 

1.3E-76 

7.6E-96 

4.8E+228 

5.0E+223 

8.0E-72 

9.9E-96 

4.0E+252 

2.3E+15 

SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

2.6E+02 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNH 

3.4E-53 

3.2E+160 

1.8E-152 

2.6E+180 

1.2E-76 

1.Ie-95 

1.2E+214 

1.3E+219 

SNB 

O.Oe+OO 

O.Oe+OO 

2.6E+02 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKC 

2.6E+180 

7.6E-96 

1.1e-42 

2.9E+161 

1.4E+219 

8.1e-72 

1.3E-76 

7.6E-96 

MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

2.6E+02 

O.Oe+OO 

2.7E+02 

2.7E+02 

MKB 

9.3E+252 

9.1E+227 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 

5.0E+223 

9.6E-48 

MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNH 

6.0E+175 

9.7E-72 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 

1.0e-42 

6.8E-38 

MNB 

O.Oe+OO 

9 

O.Oe+OO 

10 

O.Oe+OO 

11 

O.Oe+OO 

12 

O.Oe+OO 

13 

O.Oe+OO 

14 

O.Oe+OO 

15 

O.Oe+OO 

SKC 

4.1e-38 

9.3E+252 

9.1E+227 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 


SKH 

2.8E+02 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SKB 

I.Ie+243 

5.2E-67 

3.2E-57 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 


SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

3.0E+02 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNH 

9.7E-72 

3.4E-53 

3.2E+160 

1.8E-152 

2.6E+180 

1.2E-76 

1.Ie-95 


SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKC 

4.8E+228 

5.0E+223 

1.2E-76 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 


MKH 

O.Oe+OO 

2.8E+02 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

3.1E+02 


MKB 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 


MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNH 

1.2E-95 

1.2E+214 

1.3E+219 

4.2E-62 

3.4E-53 

3.2E+160 

1.8E-152 


MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 
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Table 59: Learned policy for G CO a by turn for 15 turn UrbanSim game using DQ-C, 
Boltzmann. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.4E-316 

9.5E-316 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 

5.0E+223 

2.2E-57 

SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

2.7E+02 

O.Oe+OO 

SKB 

1.3E-76 

7.6E-96 

4.8E+228 

5.0E+223 

8.0E-72 

9.9E-96 

4.0E+252 

2.3E+15 

SNC 

O.Oe+OO 

O.Oe+OO 

2.6E+02 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNH 

3.4E-53 

3.2E+160 

1.8E-152 

2.6E+180 

1.2E-76 

1.Ie-95 

1.2E+214 

1.3E+219 

SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKC 

2.6E+180 

7.6E-96 

1.1e-42 

2.9E+161 

1.4E+219 

8.1e-72 

1.3E-76 

7.6E-96 

MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKB 

9.3E+252 

9.1E+227 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 

5.0E+223 

9.6E-48 

MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

2.6E+02 

O.Oe+OO 

O.Oe+OO 

MNH 

6.0E+175 

9.7E-72 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 

1.0e-42 

6.8E-38 

MNB 

O.Oe+OO 

9 

O.Oe+OO 

10 

O.Oe+OO 

11 

O.Oe+OO 

12 

O.Oe+OO 

13 

O.Oe+OO 

14 

O.Oe+OO 

15 

2.7E+02 

SKC 

4.1e-38 

9.3E+252 

9.1E+227 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 


SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SKB 

I.Ie+243 

5.2E-67 

3.2E-57 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 


SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

3.1E+02 


SNH 

9.7e-72 

3.4E-53 

3.2E+160 

1.8E-152 

2.6E+180 

1.2E-76 

1.Ie-95 


SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKC 

4.8E+228 

5.0E+223 

1.2E-76 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 


MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKB 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 


MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

3.0E+02 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNH 

1.2e-95 

1.2E+214 

1.3E+219 

4.2E-62 

3.4E-53 

3.2E+160 

1.8E-152 


MNB 

O.Oe+OO 

2.8E+02 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 
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Table 60: Learned policy for G CO b by turn for 15 turn UrbanSim game using DQ-C, 
Boltzmann. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.4E-316 

9.5E-316 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 

5.0E+223 

2.2E-57 

SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SKB 

1.3E-76 

7.6E-96 

4.8E+228 

5.0E+223 

8.0E-72 

9.9E-96 

4.0E+252 

2.3E+15 

SNC 

O.Oe+OO 

2.5E+02 

O.Oe+OO 

2.6E+02 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNH 

3.4E-53 

3.2E+160 

1.8E-152 

2.6E+180 

1.2E-76 

1.Ie-95 

1.2E+214 

1.3E+219 

SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKC 

2.6E+180 

7.6E-96 

I.Ie-42 

2.9E+161 

1.4E+219 

8.1e-72 

1.3E-76 

7.6E-96 

MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKB 

9.3E+252 

9.1E+227 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 

5.0E+223 

9.6E-48 

MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNH 

6.0E+175 

9.7E-72 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 

1.0e-42 

6.8E-38 

MNB 

O.Oe+OO 

9 

O.Oe+OO 

10 

O.Oe+OO 

11 

O.Oe+OO 

12 

O.Oe+OO 

13 

O.Oe+OO 

14 

O.Oe+OO 

15 

O.Oe+OO 

SKC 

4.1e-38 

9.3E+252 

9.1E+227 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 


SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SKB 

I.Ie+243 

5.2E-67 

3.2E-57 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 


SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNH 

9.7E-72 

3.4E-53 

3.2E+160 

1.8E-152 

2.6E+180 

1.2E-76 

1.Ie-95 


SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKC 

4.8E+228 

5.0E+223 

1.2E-76 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 


MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKB 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 


MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNH 

1.2E-95 

1.2E+214 

1.3E+219 

4.2E-62 

3.4E-53 

3.2E+160 

1.8E-152 


MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 
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Table 61: Learned policy for H CO a by turn for 15 turn UrbanSim game using DQ-C, 
Boltzmann. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.4E-316 

9.5E-316 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 

5.0E+223 

2.2E-57 

SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SKB 

1.3E-76 

7.6E-96 

4.8E+228 

5.0E+223 

8.0E-72 

9.9E-96 

4.0E+252 

2.3E+15 

SNC 

O.Oe+OO 

2.5E+02 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNH 

3.4E-53 

3.2E+160 

1.8E-152 

2.6E+180 

1.2E-76 

1.Ie-95 

1.2E+214 

1.3E+219 

SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKC 

2.6E+180 

7.6E-96 

1.1e-42 

2.9E+161 

1.4E+219 

8.1e-72 

1.3E-76 

7.6E-96 

MKH 

2.5E+02 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKB 

9.3E+252 

9.1E+227 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 

5.0E+223 

9.6E-48 

MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNH 

6.0E+175 

9.7E-72 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 

1.0e-42 

6.8E-38 

MNB 

O.Oe+OO 

9 

O.Oe+OO 

10 

O.Oe+OO 

11 

O.Oe+OO 

12 

2.6E+02 

13 

O.Oe+OO 

14 

O.Oe+OO 

15 

2.7E+02 

SKC 

4.1e-38 

9.3E+252 

9.1E+227 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 


SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SKB 

I.Ie+243 

5.2E-67 

3.2E-57 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 


SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNH 

9.7E-72 

3.4E-53 

3.2E+160 

1.8E-152 

2.6E+180 

1.2E-76 

1.Ie-95 


SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKC 

4.8E+228 

5.0E+223 

1.2E-76 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 


MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKB 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 


MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNH 

1.2E-95 

1.2E+214 

1.3E+219 

4.2E-62 

3.4E-53 

3.2E+160 

1.8E-152 


MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 
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Table 62: Learned policy for H CO b by turn for 15 turn UrbanSim game using DQ-C, 
Boltzmann. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.4E-316 

9.5E-316 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 

5.0E+223 

2.2E-57 

SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

2.6E+02 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SKB 

1.3E-76 

7.6E-96 

4.8E+228 

5.0E+223 

8.0E-72 

9.9E-96 

4.0E+252 

2.3E+15 

SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNH 

3.4E-53 

3.2E+160 

1.8E-152 

2.6E+180 

1.2E-76 

1.Ie-95 

1.2E+214 

1.3E+219 

SNB 

2.5E+02 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKC 

2.6E+180 

7.6E-96 

1.1e-42 

2.9E+161 

1.4E+219 

8.1e-72 

1.3E-76 

7.6E-96 

MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKB 

9.3E+252 

9.1E+227 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 

5.0E+223 

9.6E-48 

MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

2.7E+02 

MNH 

6.0E+175 

9.7E-72 

4.Ie-38 

9.3E+252 

9.1E+227 

1.4E-94 

1.0e-42 

6.8E-38 

MNB 

O.Oe+OO 

9 

O.Oe+OO 

10 

O.Oe+OO 

11 

O.Oe+OO 

12 

O.Oe+OO 

13 

O.Oe+OO 

14 

2.7E+02 

15 

O.Oe+OO 

SKC 

4.Ie-38 

9.3E+252 

9.1E+227 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 


SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

3.1E+02 


SKB 

I.Ie+243 

5.2E-67 

3.2E-57 

4.Ie-38 

9.3E+252 

9.1E+227 

1.4E-94 


SNC 

O.Oe+OO 

O.Oe+OO 

2.9E+02 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNH 

9.7E-72 

3.4E-53 

3.2E+160 

1.8E-152 

2.6E+180 

1.2E-76 

1.Ie-95 


SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

3.0E+02 

O.Oe+OO 

O.Oe+OO 


MKC 

4.8E+228 

5.0E+223 

1.2E-76 

4.Ie-38 

9.3E+252 

9.1E+227 

1.4E-94 


MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKB 

4.Ie-38 

9.3E+252 

9.1E+227 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 


MNC 

2.8E+02 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

3.0E+02 

O.Oe+OO 


MNH 

1.2E-95 

1.2E+214 

1.3E+219 

4.2E-62 

3.4E-53 

3.2E+160 

1.8E-152 


MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 
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Table 63: Learned policy for QRF by turn for 15 turn UrbanSim game using DQ-C, Boltz¬ 
mann. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.4E-316 

9.5E-316 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 

5.0E+223 

2.2E-57 

SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SKB 

1.3E-76 

7.6E-96 

4.8E+228 

5.0E+223 

8.0E-72 

9.9E-96 

4.0E+252 

2.3E+15 

SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNH 

3.4E-53 

3.2E+160 

1.8E-152 

2.6E+180 

1.2E-76 

1.Ie-95 

1.2E+214 

1.3E+219 

SNB 

2.5E+02 

O.Oe+OO 

2.6E+02 

O.Oe+OO 

2.6E+02 

O.Oe+OO 

2.7E+02 

O.Oe+OO 

MKC 

2.6E+180 

7.6E-96 

l.lE-42 

2.9E+161 

1.4E+219 

8.1e-72 

1.3E-76 

7.6E-96 

MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKB 

9.3E+252 

9.1E+227 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 

5.0E+223 

9.6E-48 

MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

2.6E+02 

O.Oe+OO 

2.7E+02 

MNH 

6.0E+175 

9.7E-72 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 

1.0e-42 

6.8E-38 

MNB 

O.Oe+OO 

9 

O.Oe+OO 

10 

O.Oe+OO 

11 

O.Oe+OO 

12 

O.Oe+OO 

13 

O.Oe+OO 

14 

O.Oe+OO 

15 

O.Oe+OO 

SKC 

4.1e-38 

9.3E+252 

9.1E+227 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 


SKH 

2.8E+02 

O.Oe+OO 

O.Oe+OO 

3.0E+02 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SKB 

I.Ie+243 

5.2E-67 

3.2E-57 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 


SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

3.0E+02 

O.Oe+OO 


SNH 

9.7E-72 

3.4E-53 

3.2E+160 

1.8E-152 

2.6E+180 

1.2E-76 

1.Ie-95 


SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKC 

4.8E+228 

5.0E+223 

1.2E-76 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 


MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

3.0E+02 

O.Oe+OO 

O.Oe+OO 


MKB 

4.1E-38 

9.3E+252 

9.1E+227 

1.4E-94 

1.3E-76 

7.6E-96 

4.8E+228 


MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNH 

1.2E-95 

1.2E+214 

1.3E+219 

4.2E-62 

3.4E-53 

3.2E+160 

1.8E-152 


MNB 

O.Oe+OO 

O.Oe+OO 

2.9E+02 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 
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Table 64: Learned policy for Battalion Commander by turn for 15 turn UrbanSim game 
using Q(A), Boltzmann. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

4.7E-316 

1.2E+166 

2.4E+232 

1.4E-303 

2.0E-316 

6.0E-154 

1.1e-152 

SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SKB 

1.8E-152 

1.7E+214 

2.5E-154 

1.4E+219 

1.9E+219 

4.7E+252 

7.7E+170 

7.3E+223 

SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNH 

5.3E+228 

3.0E+222 

1.2E-259 

2.1e-312 

1.1e-311 

2.7E-308 

2.0E-309 

2.1E-301 

SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKC 

4.6E+228 

1.3E-152 

2.0E+161 

3.3E+257 

2.5E+198 

1.7E-152 

1.2E+214 

4.0E+252 

MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKB 

4.5E+257 

5.8E+180 

5.8E+252 

1.8E+190 

1.3e+213 

5.9E+199 

2.5E-154 

1.7E+214 

MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNH 

1.8E-313 

1.4E-308 

9.9E+86 

7.2E-311 

7.1e-235 

3.5E-313 

2.5E-308 

9.9E+86 

MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

Strategy 

9 

10 

11 

12 

13 

14 

15 


skc 

3.0E+180 

5.0E+175 

2.2E+161 

9.8E+199 

4.7E+257 

1.9E+227 

8.0E+165 


SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SKB 

4.7E+164 

2.1E+214 

3.0E-85 

6.0E-154 

2.0E+267 

1.7E+243 

2.3E+243 


SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNH 

1.6E-309 

6.7E-314 

6.2E-309 

1.8E+175 

4.6E-314 

9.9E-307 

5.1e-116 


SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKC 

3.0E-85 

1.7E+262 

3.1E+169 

1.9E+214 

1.8E+185 

5.8E+252 

2.0E+161 


MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKB 

1.8E-152 

6.0E+247 

5.8E+180 

5.3E-85 

3.4E-309 

6.0E+197 

3.0E-311 


MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNH 

1.2E-310 

2.4E-196 

5.2E-313 

3.7E-308 

9.9E+86 

1.6E-310 

8.2E-158 


MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 
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Table 65: Learned policy for CA unit by turn for 15 turn UrbanSim game using Q(A), 
Boltzmann. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

4.7E-316 

1.2E+166 

2.4E+232 

1.4E-303 

2.0E-316 

6.0E-154 

1.1e-152 

SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SKB 

1.8E-152 

1.7E+214 

2.5E-154 

1.4E+219 

1.9E+219 

4.7E+252 

7.7E+170 

7.3E+223 

SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNH 

5.3E+228 

3.0E+222 

1.2E-259 

2.1e-312 

1.1e-311 

2.7E-308 

2.0E-309 

2.1E-301 

SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKC 

4.6E+228 

1.3E-152 

2.0E+161 

3.3E+257 

2.5E+198 

1.7E-152 

1.2E+214 

4.0E+252 

MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKB 

4.5E+257 

5.8E+180 

5.8E+252 

1.8E+190 

1.3E+213 

5.9E+199 

2.5E-154 

1.7E+214 

MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNH 

1.8E-313 

1.4E-308 

9.9E+86 

7.2E-311 

7.1e-235 

3.5E-313 

2.5E-308 

9.9E+86 

MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

Strategy 

9 

10 

11 

12 

13 

14 

15 


skc 

3.0E+180 

5.0E+175 

2.2E+161 

9.8E+199 

4.7E+257 

1.9E+227 

8.0E+165 


SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SKB 

4.7E+164 

2.1E+214 

3.0E-85 

6.0E-154 

2.0E+267 

1.7E+243 

2.3E+243 


SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNH 

1.6E-309 

6.7E-314 

6.2E-309 

1.8E+175 

4.6E-314 

9.9E-307 

5.1e-116 


SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKC 

3.0E-85 

1.7E+262 

3.1E+169 

1.9E+214 

1.8E+185 

5.8E+252 

2.0E+161 


MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKB 

1.8E-152 

6.0E+247 

5.8E+180 

5.3E-85 

3.4E-309 

6.0E+197 

3.0E-311 


MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNH 

1.2E-310 

2.4E-196 

5.2E-313 

3.7E-308 

9.9E+86 

1.6E-310 

8.2E-158 


MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 
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Table 66: Learned policy for E CO a by turn for 15 turn UrbanSim game using Q(A), 
Boltzmann. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

4.7E-316 

1.2E+166 

2.4E+232 

1.4E-303 

2.0E-316 

6.0E-154 

1.1e-152 

SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SKB 

1.8E-152 

1.7E+214 

2.5E-154 

1.4E+219 

1.9E+219 

4.7E+252 

7.7E+170 

7.3E+223 

SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNH 

5.3E+228 

3.0E+222 

1.2E-259 

2.1e-312 

1.1e-311 

2.7E-308 

2.0E-309 

2.1E-301 

SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKC 

4.6E+228 

1.3E-152 

2.0E+161 

3.3E+257 

2.5E+198 

1.7E-152 

1.2E+214 

4.0E+252 

MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKB 

4.5E+257 

5.8E+180 

5.8E+252 

1.8E+190 

1.3E+213 

5.9E+199 

2.5E-154 

1.7E+214 

MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNH 

1.8E-313 

1.4E-308 

9.9E+86 

7.2E-311 

7.1e-235 

3.5E-313 

2.5E-308 

9.9E+86 

MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

Strategy 

9 

10 

11 

12 

13 

14 

15 


skc 

3.0E+180 

5.0E+175 

2.2E+161 

9.8E+199 

4.7E+257 

1.9E+227 

8.0E+165 


SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SKB 

4.7E+164 

2.1E+214 

3.0E-85 

6.0E-154 

2.0E+267 

1.7E+243 

2.3E+243 


SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNH 

1.6E-309 

6.7E-314 

6.2E-309 

1.8E+175 

4.6E-314 

9.9E-307 

5.1e-116 


SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKC 

3.0E-85 

1.7E+262 

3.1E+169 

1.9E+214 

1.8E+185 

5.8E+252 

2.0E+161 


MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKB 

1.8E-152 

6.0E+247 

5.8E+180 

5.3E-85 

3.4E-309 

6.0E+197 

3.0E-311 


MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNH 

1.2E-310 

2.4E-196 

5.2E-313 

3.7E-308 

9.9E+86 

1.6E-310 

8.2E-158 


MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 
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Table 67: Learned policy for F CO a by turn for 15 turn UrbanSim game using Q(A), 
Boltzmann. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

4.7E-316 

1.2E+166 

2.4E+232 

1.4E-303 

2.0E-316 

6.0E-154 

1.1e-152 

SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SKB 

1.8E-152 

1.7E+214 

2.5E-154 

1.4E+219 

1.9E+219 

4.7E+252 

7.7E+170 

7.3E+223 

SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNH 

5.3E+228 

3.0E+222 

1.2E-259 

2.1e-312 

1.1e-311 

2.7E-308 

2.0E-309 

2.1E-301 

SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKC 

4.6E+228 

1.3E-152 

2.0E+161 

3.3E+257 

2.5E+198 

1.7E-152 

1.2E+214 

4.0E+252 

MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKB 

4.5E+257 

5.8E+180 

5.8E+252 

1.8E+190 

1.3E+213 

5.9E+199 

2.5E-154 

1.7E+214 

MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNH 

1.8E-313 

1.4E-308 

9.9E+86 

7.2E-311 

7.1e-235 

3.5E-313 

2.5E-308 

9.9E+86 

MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

Strategy 

9 

10 

11 

12 

13 

14 

15 


skc 

3.0E+180 

5.0E+175 

2.2E+161 

9.8E+199 

4.7E+257 

1.9E+227 

8.0E+165 


SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SKB 

4.7E+164 

2.1E+214 

3.0E-85 

6.0E-154 

2.0E+267 

1.7E+243 

2.3E+243 


SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNH 

1.6E-309 

6.7E-314 

6.2E-309 

1.8E+175 

4.6E-314 

9.9E-307 

5.1e-116 


SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKC 

3.0E-85 

1.7E+262 

3.1E+169 

1.9E+214 

1.8E+185 

5.8E+252 

2.0E+161 


MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKB 

1.8E-152 

6.0E+247 

5.8E+180 

5.3E-85 

3.4E-309 

6.0E+197 

3.0E-311 


MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNH 

1.2E-310 

2.4E-196 

5.2E-313 

3.7E-308 

9.9E+86 

1.6E-310 

8.2E-158 


MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 
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Table 68: Learned policy for E CO b by turn for 15 turn UrbanSim game using Q(A), 
Boltzmann. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

6.4E-316 

1.2E+166 

2.4E+232 

1.4E-303 

2.0E-316 

6.0E-154 

1.1e-152 

SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SKB 

1.8E-152 

1.7E+214 

2.5E-154 

1.4E+219 

1.9E+219 

4.7E+252 

7.7E+170 

7.3E+223 

SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNH 

5.3E+228 

3.0E+222 

1.2E-259 

2.1e-312 

1.1e-311 

2.7E-308 

2.0E-309 

2.1E-301 

SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKC 

4.6E+228 

1.3E-152 

2.0E+161 

3.3E+257 

2.5E+198 

1.7E-152 

1.2E+214 

4.0E+252 

MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKB 

4.5E+257 

5.8E+180 

5.8E+252 

1.8E+190 

1.3E+213 

5.9E+199 

2.5E-154 

1.7E+214 

MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNH 

1.8E-313 

1.4E-308 

9.9E+86 

7.2E-311 

7.1e-235 

3.5E-313 

2.5E-308 

9.9E+86 

MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

Strategy 

9 

10 

11 

12 

13 

14 

15 


skc 

3.0E+180 

5.0E+175 

2.2E+161 

9.8E+199 

4.7E+257 

1.9E+227 

8.0E+165 


SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SKB 

4.7E+164 

2.1E+214 

3.0E-85 

6.0E-154 

2.0E+267 

1.7E+243 

2.3E+243 


SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNH 

1.6E-309 

6.7E-314 

6.2E-309 

1.8E+175 

4.6E-314 

9.9E-307 

5.1e-116 


SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKC 

3.0E-85 

1.7E+262 

3.1E+169 

1.9E+214 

1.8E+185 

5.8E+252 

2.0E+161 


MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKB 

1.8E-152 

6.0E+247 

5.8E+180 

5.3E-85 

3.4E-309 

6.0E+197 

3.0E-311 


MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNH 

1.2E-310 

2.4E-196 

5.2E-313 

3.7E-308 

9.9E+86 

1.6E-310 

8.2E-158 


MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 
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Table 69: Learned policy for F CO b by turn for 15 turn UrbanSim game using Q(A), 
Boltzmann. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

4.7E-316 

1.2E+166 

2.4E+232 

1.4E-303 

2.0E-316 

6.0E-154 

1.1e-152 

SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SKB 

1.8E-152 

1.7E+214 

2.5E-154 

1.4E+219 

1.9E+219 

4.7E+252 

7.7E+170 

7.3E+223 

SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNH 

5.3E+228 

3.0E+222 

1.2E-259 

2.1e-312 

1.1e-311 

2.7E-308 

2.0E-309 

2.1E-301 

SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKC 

4.6E+228 

1.3E-152 

2.0E+161 

3.3E+257 

2.5E+198 

1.7E-152 

1.2E+214 

4.0E+252 

MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKB 

4.5E+257 

5.8E+180 

5.8E+252 

1.8E+190 

1.3E+213 

5.9E+199 

2.5E-154 

1.7E+214 

MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNH 

1.8E-313 

1.4E-308 

9.9E+86 

7.2E-311 

7.1e-235 

3.5E-313 

2.5E-308 

9.9E+86 

MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

Strategy 

9 

10 

11 

12 

13 

14 

15 


skc 

3.0E+180 

5.0E+175 

2.2E+161 

9.8E+199 

4.7E+257 

1.9E+227 

8.0E+165 


SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SKB 

4.7E+164 

2.1E+214 

3.0E-85 

6.0E-154 

2.0E+267 

1.7E+243 

2.3E+243 


SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNH 

1.6E-309 

6.7E-314 

6.2E-309 

1.8E+175 

4.6E-314 

9.9E-307 

5.1e-116 


SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKC 

3.0E-85 

1.7E+262 

3.1E+169 

1.9E+214 

1.8E+185 

5.8E+252 

2.0E+161 


MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKB 

1.8E-152 

6.0E+247 

5.8E+180 

5.3E-85 

3.4E-309 

6.0E+197 

3.0E-311 


MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNH 

1.2E-310 

2.4E-196 

5.2E-313 

3.7E-308 

9.9E+86 

1.6E-310 

8.2E-158 


MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 
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Table 70: Learned policy for G CO b by turn for 15 turn UrbanSim game using Q(A), 
Boltzmann. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

4.7E-316 

1.2E+166 

2.4E+232 

1.4E-303 

2.0E-316 

6.0E-154 

1.1e-152 

SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SKB 

1.8E-152 

1.7E+214 

2.5E-154 

1.4E+219 

1.9E+219 

4.7E+252 

7.7E+170 

7.3E+223 

SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNH 

5.3E+228 

3.0E+222 

1.2E-259 

2.1e-312 

1.1e-311 

2.7E-308 

2.0E-309 

2.1E-301 

SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKC 

4.6E+228 

1.3E-152 

2.0E+161 

3.3E+257 

2.5E+198 

1.7E-152 

1.2E+214 

4.0E+252 

MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKB 

4.5E+257 

5.8E+180 

5.8E+252 

1.8E+190 

1.3E+213 

5.9E+199 

2.5E-154 

1.7E+214 

MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNH 

1.8E-313 

1.4E-308 

9.9E+86 

7.2E-311 

7.1e-235 

3.5E-313 

2.5E-308 

9.9E+86 

MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

Strategy 

9 

10 

11 

12 

13 

14 

15 


skc 

3.0E+180 

5.0E+175 

2.2E+161 

9.8E+199 

4.7E+257 

1.9E+227 

8.0E+165 


SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SKB 

4.7E+164 

2.1E+214 

3.0E-85 

6.0E-154 

2.0E+267 

1.7E+243 

2.3E+243 


SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNH 

1.6E-309 

6.7E-314 

6.2E-309 

1.8E+175 

4.6E-314 

9.9E-307 

5.1e-116 


SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKC 

3.0E-85 

1.7E+262 

3.1E+169 

1.9E+214 

1.8E+185 

5.8E+252 

2.0E+161 


MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKB 

1.8E-152 

6.0E+247 

5.8E+180 

5.3E-85 

3.4E-309 

6.0E+197 

3.0E-311 


MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNH 

1.2E-310 

2.4E-196 

5.2E-313 

3.7E-308 

9.9E+86 

1.6E-310 

8.2E-158 


MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 
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Table 71: Learned policy for G CO a by turn for 15 turn UrbanSim game using Q(A), 
Boltzmann. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

4.7E-316 

1.2E+166 

2.4E+232 

1.4E-303 

2.0E-316 

6.0E-154 

1.1e-152 

SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SKB 

1.8E-152 

1.7E+214 

2.5E-154 

1.4E+219 

1.9E+219 

4.7E+252 

7.7E+170 

7.3E+223 

SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNH 

5.3E+228 

3.0E+222 

1.2E-259 

2.1e-312 

1.1e-311 

2.7E-308 

2.0E-309 

2.1E-301 

SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKC 

4.6E+228 

1.3E-152 

2.0E+161 

3.3E+257 

2.5E+198 

1.7E-152 

1.2E+214 

4.0E+252 

MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKB 

4.5E+257 

5.8E+180 

5.8E+252 

1.8E+190 

1.3E+213 

5.9E+199 

2.5E-154 

1.7E+214 

MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNH 

1.8E-313 

1.4E-308 

9.9E+86 

7.2E-311 

7.1e-235 

3.5E-313 

2.5E-308 

9.9E+86 

MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

Strategy 

9 

10 

11 

12 

13 

14 

15 


skc 

3.0E+180 

5.0E+175 

2.2E+161 

9.8E+199 

4.7E+257 

1.9E+227 

8.0E+165 


SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SKB 

4.7E+164 

2.1E+214 

3.0E-85 

6.0E-154 

2.0E+267 

1.7E+243 

2.3E+243 


SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNH 

1.6E-309 

6.7E-314 

6.2E-309 

1.8E+175 

4.6E-314 

9.9E-307 

5.1e-116 


SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKC 

3.0E-85 

1.7E+262 

3.1E+169 

1.9E+214 

1.8E+185 

5.8E+252 

2.0E+161 


MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKB 

1.8E-152 

6.0E+247 

5.8E+180 

5.3E-85 

3.4E-309 

6.0E+197 

3.0E-311 


MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNH 

1.2E-310 

2.4E-196 

5.2E-313 

3.7E-308 

9.9E+86 

1.6E-310 

8.2E-158 


MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 
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Table 72: Learned policy for H CO a by turn for 15 turn UrbanSim game using Q(A), 
Boltzmann. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

6.4E-316 

1.2E+166 

2.4E+232 

1.4E-303 

2.0E-316 

6.0E-154 

1.1e-152 

SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SKB 

1.8E-152 

1.7E+214 

2.5E-154 

1.4E+219 

1.9E+219 

4.7E+252 

7.7E+170 

7.3E+223 

SNC 

O.Oe+OO 

2.8E+02 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNH 

5.3E+228 

3.0E+222 

1.2E-259 

2.1e-312 

1.1e-311 

2.7E-308 

2.0E-309 

2.1E-301 

SNB 

3.6E+02 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKC 

4.6E+228 

1.3E-152 

2.0E+161 

3.3E+257 

2.5E+198 

1.7E-152 

1.2E+214 

4.0E+252 

MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKB 

4.5E+257 

5.8E+180 

5.8E+252 

1.8E+190 

1.3E+213 

5.9E+199 

2.5E-154 

1.7E+214 

MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNH 

1.8E-313 

1.4E-308 

9.9E+86 

7.2E-311 

7.1e-235 

3.5E-313 

2.5E-308 

9.9E+86 

MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

2.7E+02 

O.Oe+OO 

O.Oe+OO 

2.9E+02 

Strategy 

9 

10 

11 

12 

13 

14 

15 


skc 

3.0E+180 

5.0E+175 

2.2E+161 

9.8E+199 

4.7E+257 

1.9E+227 

8.0E+165 


SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SKB 

4.7E+164 

2.1E+214 

3.0E-85 

6.0E-154 

2.0E+267 

1.7E+243 

2.3E+243 


SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNH 

1.6E-309 

6.7E-314 

6.2E-309 

1.8E+175 

4.6E-314 

9.9E-307 

5.1e-116 


SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKC 

3.0E-85 

1.7E+262 

3.1E+169 

1.9E+214 

1.8E+185 

5.8E+252 

2.0E+161 


MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKB 

1.8E-152 

6.0E+247 

5.8E+180 

5.3E-85 

3.4E-309 

6.0E+197 

3.0E-311 


MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNH 

1.2E-310 

2.4E-196 

5.2E-313 

3.7E-308 

9.9E+86 

1.6E-310 

8.2E-158 


MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 
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Table 73: Learned policy for H CO b by turn for 15 turn UrbanSim game using Q(A), 
Boltzmann. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

6.4E-316 

1.2E+166 

2.4E+232 

1.4E-303 

2.0E-316 

6.0E-154 

1.1e-152 

SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SKB 

1.8E-152 

1.7E+214 

2.5E-154 

1.4E+219 

1.9E+219 

4.7E+252 

7.7E+170 

7.3E+223 

SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNH 

5.3E+228 

3.0E+222 

1.2E-259 

2.1e-312 

1.1e-311 

2.7E-308 

2.0E-309 

2.1E-301 

SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKC 

4.6E+228 

1.3E-152 

2.0E+161 

3.3E+257 

2.5E+198 

1.7E-152 

1.2E+214 

4.0E+252 

MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKB 

4.5E+257 

5.8E+180 

5.8E+252 

1.8E+190 

1.3E+213 

5.9E+199 

2.5E-154 

1.7E+214 

MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNH 

1.8E-313 

1.4E-308 

9.9E+86 

7.2E-311 

7.1e-235 

3.5E-313 

2.5E-308 

9.9E+86 

MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

Strategy 

9 

10 

11 

12 

13 

14 

15 


skc 

3.0E+180 

5.0E+175 

2.2E+161 

9.8E+199 

4.7E+257 

1.9E+227 

8.0E+165 


SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SKB 

4.7E+164 

2.1E+214 

3.0E-85 

6.0E-154 

2.0E+267 

1.7E+243 

2.3E+243 


SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNH 

1.6E-309 

6.7E-314 

6.2E-309 

1.8E+175 

4.6E-314 

9.9E-307 

5.1e-116 


SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKC 

3.0E-85 

1.7E+262 

3.1E+169 

1.9E+214 

1.8E+185 

5.8E+252 

2.0E+161 


MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKB 

1.8E-152 

6.0E+247 

5.8E+180 

5.3E-85 

3.4E-309 

6.0E+197 

3.0E-311 


MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNH 

1.2E-310 

2.4E-196 

5.2E-313 

3.7E-308 

9.9E+86 

1.6E-310 

8.2E-158 


MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 
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Table 74: Learned policy for QRF by turn for 15 turn UrbanSim game using Q(A), Boltz¬ 
mann. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

skc 

6.3E-316 

6.4E-316 

1.2E+166 

2.4E+232 

1.4E-303 

2.0E-316 

6.0E-154 

1.1e-152 

SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SKB 

1.8E-152 

1.7E+214 

2.5E-154 

1.4E+219 

1.9E+219 

4.7E+252 

7.7E+170 

7.3E+223 

SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

SNH 

5.3E+228 

3.0E+222 

1.2E-259 

2.1e-312 

1.1e-311 

2.7E-308 

2.0E-309 

2.1E-301 

SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKC 

4.6E+228 

1.3E-152 

2.0E+161 

3.3E+257 

2.5E+198 

1.7E-152 

1.2E+214 

4.0E+252 

MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MKB 

4.5E+257 

5.8E+180 

5.8E+252 

1.8E+190 

1.3E+213 

5.9E+199 

2.5E-154 

1.7E+214 

MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

MNH 

1.8E-313 

1.4E-308 

9.9E+86 

7.2E-311 

7.1e-235 

3.5E-313 

2.5E-308 

9.9E+86 

MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

Strategy 

9 

10 

11 

12 

13 

14 

15 


skc 

3.0E+180 

5.0E+175 

2.2E+161 

9.8E+199 

4.7E+257 

1.9E+227 

8.0E+165 


SKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SKB 

4.7E+164 

2.1E+214 

3.0E-85 

6.0E-154 

2.0E+267 

1.7E+243 

2.3E+243 


SNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


SNH 

1.6E-309 

6.7E-314 

6.2E-309 

1.8E+175 

4.6E-314 

9.9E-307 

5.1e-116 


SNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKC 

3.0E-85 

1.7E+262 

3.1E+169 

1.9E+214 

1.8E+185 

5.8E+252 

2.0E+161 


MKH 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MKB 

1.8E-152 

6.0E+247 

5.8E+180 

5.3E-85 

3.4E-309 

6.0E+197 

3.0E-311 


MNC 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 


MNH 

1.2E-310 

2.4E-196 

5.2E-313 

3.7E-308 

9.9E+86 

1.6E-310 

8.2E-158 


MNB 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 

O.Oe+OO 
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APPENDIX H: STRATEGY LEVEL POLICIES BY ALGORITHM 


URBANSIM 


Table 75: Learned policy for all agents by turn for 15 turn UrbanSim game using Q(A), 
e-greedy. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

Battalion Commander 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

CA UNIT 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

ECO A 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

ECOb 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

F CO A 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

F CO B 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

G CO A 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

G CO B 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

H CO A 

SNH 

SNH 

SNB 

SNB 

SNH 

MNB 

MKB 

SKH 

MNB 

MKB 

SKC 

SNB 

SKH 

SKB 

SNC 

H CO B 

SNH 

SNH 

SNB 

SNB 

SNH 

MNB 

MKB 

SKH 

MNB 

MKB 

SKC 

SNB 

SKH 

SKB 

SNC 

QRF 

SNH 

SNH 

SNB 

SNB 

SNH 

MNB 

MKB 

SKH 

MNB 

MKB 

SKC 

SNB 

SKH 

SKB 

SNC 


Table 76: Learned policy for all agents by turn for 15 turn UrbanSim game using DQ-C, 
e-greedy. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

Battalion Commander 

MKH 

SNB 

MKH 

MNC 

SKC 

MNH 

MNH 

SNH 

SKB 

MNC 

MKH 

SNB 

SNH 

MNH 

SNC 

CA UNIT 

SNC 

MKH 

MKH 

MNH 

MKB 

MKB 

SNC 

SNH 

MNC 

SKH 

SNH 

MNH 

SKC 

MKH 

SNH 

ECO A 

SKH 

SKC 

SKH 

MKC 

MNB 

SKB 

MNC 

MNH 

MNB 

MKC 

MNC 

MKH 

MNH 

MKB 

SKH 

ECOb 

SNB 

MKB 

SNC 

MKC 

SNH 

SNB 

MKB 

SKB 

SNC 

SNC 

SNB 

SKH 

MNH 

MKB 

MNB 

F CO A 

SNB 

SKC 

SNH 

SKC 

SKB 

SKB 

SNC 

SNB 

MNH 

MNH 

MNC 

SNC 

MNC 

SKH 

MNC 

F CO B 

SNH 

SKH 

SKC 

MKB 

SNC 

SKC 

MKB 

SNC 

SNC 

MNC 

MKB 

SNC 

SNH 

MKC 

MKB 

G CO A 

SNB 

SKB 

MNC 

MKB 

SNH 

SKB 

MNH 

SNH 

MKC 

SNH 

SKB 

SKC 

SKC 

MNC 

SNB 

G CO B 

SNB 

MNC 

SNB 

MNH 

SNH 

MKC 

SKC 

MNH 

SKH 

MKC 

SNB 

SKC 

SKC 

SNH 

SNH 

H CO A 

SKC 

MNH 

MNH 

SNB 

SKH 

SKB 

SKH 

SNB 

SKB 

MKH 

SKC 

SNB 

SKC 

SKH 

SNC 

H CO B 

SNH 

MNB 

SNB 

MKB 

SKH 

MKB 

SKC 

MNC 

SNB 

SKC 

SNB 

MNB 

SKH 

SKH 

SNH 

QRF 

SNH 

SKH 

MNH 

MKC 

MNB 

MKH 

MNB 

SKH 

SKC 

SNH 

SKH 

MNH 

MKH 

MNH 

SKB 
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Table 77: Learned policy for all agents by turn for 15 turn UrbanSim game using DQ-C 
Boltzmann. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

Battalion Commander 

MKH 

MNH 

MNH 

SKC 

MKH 

MKB 

MKB 

MNB 

SNH 

SKB 

MNB 

SKC 

MKB 

SKC 

SNC 

CA UNIT 

MKB 

SNC 

MNB 

SKC 

MKB 

MKH 

SNB 

MNH 

SNH 

SKC 

MNH 

SNC 

SKC 

SNH 

MNH 

ECO A 

SNH 

SKH 

MKH 

SNC 

SKB 

MKC 

MNC 

SKB 

MKC 

MNH 

MKB 

MNH 

SNC 

SNC 

MKB 

ECOb 

MNB 

SNH 

MKC 

SNC 

MKH 

SKC 

SNB 

MNH 

SNH 

SNB 

SKC 

MNH 

SKB 

MNC 

SNC 

F CO A 

MNH 

MNC 

SKC 

MKC 

SNB 

SKB 

SKC 

MKH 

SNB 

SNB 

SNB 

SNH 

SNC 

MKC 

MNB 

F CO B 

MNH 

MKB 

SNB 

SNC 

MKH 

SNH 

MKH 

MKH 

SKH 

MKH 

SNH 

SNC 

MKB 

MKC 

MKH 

G CO A 

SNH 

MKB 

SNC 

MNH 

SNH 

MNC 

SKH 

MNB 

SKC 

MNB 

MKC 

MNC 

SNH 

SKB 

SNC 

G CO B 

MNH 

SNC 

MKB 

SNC 

MKB 

MKB 

SKB 

SNH 

MKC 

SNH 

SNH 

MKC 

MKB 

SKB 

MKB 

H CO A 

MKH 

SNC 

SKC 

MKC 

MNB 

MKC 

SKC 

MNB 

MNH 

MNH 

SKB 

SKB 

MNH 

SKB 

MKB 

H CO B 

SNB 

SKC 

MKB 

SKB 

SKH 

SKC 

MNB 

MNC 

MNC 

SKB 

SNC 

SKB 

SNB 

MNC 

SKH 

QRF 

SNB 

SKB 

SNB 

SKB 

SNB 

MNC 

SNB 

MNC 

SKH 

SKB 

MNB 

SKH 

MKH 

SNC 

SKB 


Table 78: Learned policy for all agents by turn for 15 turn UrbanSim game using 
SARSA(A), e-greedy. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

Battalion Commander 

MNC 

MNB 

SKC 

SNH 

SKC 

SNH 

SNB 

SNB 

SKC 

MKC 

MNC 

MNC 

SNC 

SNC 

SNC 

CA UNIT 

MNC 

MNB 

SKC 

SNH 

MNC 

SNH 

SNB 

SNC 

SKC 

SKH 

MNC 

MNC 

SNC 

SNC 

SNC 

ECO A 

MNC 

MNB 

SKC 

SNH 

MNC 

SNH 

SNB 

SNC 

SKC 

SKH 

MNC 

MNC 

SNC 

SNC 

SNC 

ECOb 

SNH 

MNC 

SNC 

MKH 

MNH 

SKC 

SNH 

SKB 

MNC 

SKH 

MKH 

MNC 

SKH 

SKC 

SKB 

F CO A 

SNH 

MNC 

SNC 

MKH 

MNH 

SKC 

SNH 

SKB 

MNC 

SKH 

MKH 

MNC 

SKH 

SKC 

SKB 

F CO B 

SNH 

MNC 

SNC 

MKH 

MNH 

SKC 

SNH 

SKB 

MNC 

SKH 

MKH 

MNC 

SKH 

SKC 

SKB 

G CO A 

SNH 

MNC 

SNC 

MKH 

MNH 

SKC 

SNH 

SKB 

MNC 

SKH 

MKH 

MNC 

SKH 

SKC 

SKB 

G CO B 

SNH 

MNC 

SNC 

MKH 

MNH 

SKC 

SNH 

SKB 

MNC 

SKH 

MKH 

MNC 

SKH 

SKC 

SKB 

H CO A 

SNH 

MNC 

SNC 

MKH 

MNH 

SKC 

SNH 

SKB 

MNC 

SKH 

MKH 

MNC 

SKH 

SKC 

SKB 

H CO B 

SNH 

MNC 

SNC 

MKH 

MNH 

SKC 

SNH 

SKB 

MNC 

SKH 

MKH 

MNC 

SKH 

SKC 

SKB 

QRF 

SNH 

MNC 

SNC 

MKH 

MNH 

SKC 

SNH 

SKB 

MNC 

SKH 

MKH 

MNC 

SKH 

SKC 

SKB 
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Table 79: Learned policy for all agents by turn for 15 turn UrbanSim game using 
SARSA(A), Boltzmann. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

Battalion Commander 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

CA UNIT 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

ECO A 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

ECOb 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

F CO A 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

F CO B 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

G CO A 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

G CO B 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

H CO A 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

H CO B 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

QRF 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 

SKC 


Table 80: Learned policy for all agents by turn for 15 turn UrbanSim game using Q(A), 
Boltzmann. 


Strategy 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

Battalion Commander 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

CA UNIT 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

ECO A 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

ECOb 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

F CO A 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

F CO B 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

G CO A 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

G CO B 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

MNB 

H CO A 

SNB 

SNC 

SKC 

MKC 

MNB 

MKC 

SKC 

MNB 

MNH 

MNH 

SKB 

SKB 

MNH 

SKB 

MKB 

H CO B 

SNB 

SNC 

SKC 

MKC 

MNB 

MKC 

SKC 

MNB 

MNH 

MNH 

SKB 

SKB 

MNH 

SKB 

MKB 

QRF 

SNB 

SNC 

SKC 

MKC 

MNB 

MKC 

SKC 

MNB 

MNH 

MNH 

SKB 

SKB 

MNH 

SKB 

MKB 
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APPENDIX I: ADDITIONAL BENCHMARKING RESULTS 


This section provides additional insights into the performance of DQ-C in bench¬ 
marking environments. 



Figure 54: Performance of DQ-C in 5x5 deterministic grid world over y by a paired with 
e-greedy. 
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Figure 55: Performance of DQ-C in 5x5 deterministic grid world over y by a paired with 
Boltzmann exploration. 
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Figure 56: Performance of DQ-C in 5x5 deterministic grid world over y by e paired with 
e-greedy with cc = 0.6. 
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Figure 57: Performance of DQ-C in 5x5 deterministic grid world over y by t paired with 
Boltzmann exploration with a = 0.6. 
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Figure 58: Performance of SARSA(A) in 5x5 deterministic grid world over y by A paired 
with e-greedy. 
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Figure 59: Performance of SARSA(A) in 5x5 deterministic grid world over y by A paired 
with Boltzmann exploration. 
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Figure 60: Performance of SARSA(A) in 5x5 deterministic grid world over y by a paired 
with e-greedy with a = 0.6. 
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Figure 61: Performance of SARSA(A) in 5x5 deterministic grid world over y by t paired 
with Boltzmann exploration with oc = 0.6. 
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Figure 62: Performance of DQ-C in noisy 10-arm bandit over y by a paired with e-greedy. 
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Figure 63: Performance of DQ-C in noisy 10-arm bandit over y by a paired with Botzmann 
exploration. 
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Figure 64: Performance of DQ-C in noisy 10-arm bandit over y by t paired with e-greedy, 
a = 0.6. 


279 









Figure 65: Performance of DQ-C in noisy 10-arm bandit over y by t paired with Botzmann 
exploration, a = 0.6. 
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Table 81: Parameter settings that produced the best observed values for noisy 10-arm bandit. 


ALGORITHM-7T 

a 

y 

e OR t 

A 

DQ-C, Boltzmann 

0.9 

0.1 

0.1 

NA 

DQ-C, e-GREEDY 

0.9 

0.6 

0.1 

NA 

Q(A), Boltzmann 

0.2 

0.8 

0.1 

0.4 

Q(A), e-GREEDY 

0.2 

0.5 

0.1 

0.8 

SARSA(A), Boltzmann 

0.9 

0.5 

0.3 

0.8 

SARSA(A), e-GREEDY 

0.2 

0.4 

0.1 

0.9 


Table 82: Parameter settings that produced the best observed values for 5x5 Deterministic Grid- 
world. 


ALGORITHM-7T 

a 

y 

e OR t 

A 

DQ-C, Boltzmann 

0.1 

0.5 

0.1 

NA 

DQ-C, e-GREEDY 

0.1 

0.4 

0.1 

NA 

Q(A), Boltzmann 

0.9 

0.5 

0.4 

0.3 

Q(A), e-GREEDY 

0.8 

0.6 

0.4 

0.5 

SARSA(A), Boltzmann 

0.6 

0.9 

0.1 

0.9 

SARSA(A), e-GREEDY 

0.3 

0.9 

0.1 

0.3 


Table 83: Parameter settings that produced the best observed values for 5x5 Stochastic Gridworld. 


ALGORITHM-7T 

cX 

y 

e OR t 

A 

DQ-C, Boltzmann 

0.7 

0.9 

0.1 

NA 

DQ-C, e-GREEDY 

0.8 

0.4 

0.1 

NA 

Q(A), Boltzmann 

0.9 

0.9 

0.2 

0.5 

Q(A), e-GREEDY 

0.5 

0.5 

0.3 

0.5 

SARSA(A), Boltzmann 

0.6 

0.9 

0.1 

0.9 

SARSA(A), e-GREEDY 

0.4 

0.9 

0.1 

0.9 
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